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Abstract 

In this paper we introduce a fundamental principle for optimal communication over general memoryless chan- 
' nels in the presence of noiseless feedback, termed posterior matching. Using this principle, we devise a (simple, 

sequential) generic feedback transmission scheme suitable for a large class of memoryless channels and input dis- 
tributions, achieving any rate below the corresponding mutual information. This provides a unified framework for 
^ ' optimal feedback communication in which the Horstein scheme (BSC) and the Schalkwijk-Kailath scheme (AWGN 

, channel) are special cases. Thus, as a corollary, we prove that the Horstein scheme indeed attains the BSC capac- 

ity, settling a longstanding conjecture. We further provide closed form expressions for the error probability of the 
scheme over a range of rates, and derive the achievable rates in a mismatch setting where the scheme is designed 
according to the wrong channel model. Several illustrative examples of the posterior matching scheme for spe- 
cific channels are given, and the corresponding error probability expressions are evaluated. The proof techniques 
employed utilize novel relations between information rates and contraction properties of iterated function systems. 

o 

I Introduction 

(N 

^> ' Feedback cannot increase the capacity of memoryless channels HHJl, but can significantly improve error probability 
OO , performance, and perhaps more importantly - can drastically simplify capacity achieving transmission schemes. 

Whereas complex coding techniques strive to approach capacity in the absence of feedback, that same goal can 
sometimes be attained using noiseless feedback via simple deterministic schemes that work "on the fly". Probably 
the first elegant feedback scheme in that spirit is due to Horstein p j for the Binary Symmetric Channel (BSC). In that 
work, information is represented by a uniformly distributed message point over the unit interval, its binary expansion 
' representing an infinite random binary sequence. The message point is then conveyed to the receiver in an increasing 
, resolution by always indicating whether it lies to the left or to the right of its posterior distribution's median, which 
is also available to the transmitter via feedback. Loosely speaking, using this strategy the transmitter always answers 
the most informative binary question that can be posed by the receiver based on the information the latter has. Bits 
' from the binary representation of the message point are decoded by the receiver whenever their respective intervals 
accumulate a sufficient posterior probability mass. The Horstein scheme was conjectured to achieve the capacity of 
the BSC, but this claim was verified only for a discrete set of crossover probability values for which the medians 
exhibit regular behavior ||4]|5|, and otherwise not rigorously established hithertcQ. 

A few years later, two landmark papers by Schalkwijk-Kailath |7| and Schalkwijk [8] presented an elegant 
capacity achieving feedback scheme for the Additive White Gaussian Noise (AWGN) channel with an average 
power constraint. The Schalkwijk-Kailath scheme is "parameter estimation" in spirit, and its simplest realization 
is described as follows: Fixing a rate R and a block length n, the unit interval is partitioned into 2"^ equal length 
subintervals, and a (deterministic) message point is selected as one of the subintervals' midpoints. The transmitter 
first sends the message point itself, which is corrupted by the additive Gaussian noise in the channel and so received 
with some bias. The goal of the transmitter is now to refine the receiver's knowledge of that bias, thereby zooming-in 
on the message point. This is achieved by computing the Minimum Mean Square Error (MMSE) estimate of the 
bias given the output sequence observed thus far, and sending the error term amplified to match the permissible input 
power constraint, on each channel use. At the end of transmission the receiver uses a nearest neighbor decoding rule 
to recover the message point. This linear scheme is strikingly simple and yet achieves capacity; in fact at any rate 
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below capacity it has an error probability decaying double-exponentially with the block length, as opposed to the 
single exponential attained by non-feedback schemes. A clean analysis of the SchaUcwijk-Kailath scheme can be 
found in |[9l and a discussion of a sequential delay-universal variant is given in ifTOl . 

Since the emergence of the Horstein and the Schalkwijk-Kailath schemes, it was evident that these are similar 
in some fundamental sense. Both schemes use the message point representation, and both attempt to "steer" the 
receiver in the right direction by transmitting what is still missing in order to "get it right". However, neither the 
precise correspondence nor a generalization to other cases has ever been established. In this paper, we show that 
in fact there exists an underlying principal, which we term posterior matching, that connects these two schemes. 
Applying this principle, we present a simple recursive feedback transmission scheme that can be tailored to any 
memoryless channel and any desired input distribution (e.g., capacity achieving under some input constraints), and is 
optimal in the sense of achieving the corresponding mutual information, under general conditions. Loosely speaking, 
the new scheme operates as follows: At each time instance, the transmitter computes the posterior distribution of 
the message point given the receiver's observations. According to the posterior, it "shapes" the message point into a 
random variable that is independent of the receiver's observations and has the desired input distribution, and transmits 
it over the channel. Intuitively, this random variable captures the information still missing at the receiver, described 
in a way that best matches the channel input. In the special cases of a BSC with uniform input distribution and an 
AWGN channel with a Gaussian input distribution, the posterior matching scheme is reduced to those of Horstein 
and Schalkwijk-Kailath respectively, thereby also proving the Horstein conjecture as a corollary. 

The paper is organized as follows. In Section|IIl notations and necessary mathematical background are provided. 
In SectionHni the posterior matching principle is introduced and the corresponding transmission scheme is derived. 
Technical regularity conditions for channels and input distributions are discussed in Section HV] The main result 
of this paper, the achievability of the mutual information via posterior matching, is presented in Section [V] Error 
probability analysis is addressed in Section IVII where closed-form expressions are provided for a range of rates 
(sometimes strictly) below the mutual information. Some extensions including variants of the baseline scheme, and 
the penalty in rate incurred by a channel model mismatch, are addressed in Section IVIII A discussion and some 
future research items appear in Section lVlIll Several illustrative examples are discussed and revisited throughout the 
paper, clarifying the ideas developed. 

II Preliminaries 

In this section we provide some necessary mathematical background. Notations and definitions are given in Sub- 
section [A] Information theoretic notions pertaining to the setting of communication with feedback are described in 
Subsection |B] An introduction to the main mathematical tools used in the paper, continuous state-space Markov 
chains and iterated function systems, is given in Subsections IClandlD] 

A Notations and Definitions 

Random variables (rv.'s) are denoted by upper-case letters, their realizations by corresponding lower-case letters. A 
real-valued rv. X is associated with a probability distribution Px (• ) defined on the usual Borel cr-algebra over R, and 
we write X ~ Px- The cumulative distribution function (c.d.f.) of X is given by Fx{x) = Px{{—oo, x]), and the 
inverse c.d.f. is defined by F^^{t) = inf{x : Fx{x) > t}. Unless otherwise stated, we assume that any real-valued 
rv. X is either continuous, discrete, or a mixture of the twcQ. Accordingly, X admits a (wide sense) probability 
density function (p.d.f.) fx (x), which can be written as a mixture of a Lebesgue integrable function (continuous part) 
and Dirac delta functions (discrete part). If there is only a continuous part then X and its distribution/c.d.f./p.d.f. are 
called proper. The support of X is the intersection of all closed sets A for which Px(]R\ ^) = 0, and is denoted 
supp(X)|j For brevity, we write Pxix) for Px[{x}), and x G supp(X) is called a mass point if Px{x) > 0. 
The discrete part of the support is the set of all mass points, and the continuous part the complement set. The 
interior of the support is denoted by supp(X) for short. A vector of real-valued rv.'s X" = {Xi,X2, ■ ■ ■ , Xn) is 
similarly associated with Px^, ^X", fx^ and with supp(X"), where the p.d.f. is now called proper if all the scalar 
conditional distributions are a.s. (almost surely) proper We write E(-) for expectation and P(-) for the probability 
of a measurable event within the parentheses. The uniform probability distribution over (0, 1) is denoted throughout 

-This restricts Fx to be the sum of an absolutely continuous function (continuous part) and a jump function (discrete part). This is to say we 
avoid the case of a singular part, where Px assigns positive probability to some uncountable set of zero Lebesgue measure. 
^^This coincides with the usual definitions of support for continuous and discrete r.v.'s. 
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by 14. A measurable bijective function ^ : (0,1) M> (0, 1) is called a uniformity preserving function (u.p.f.) ifQ^U 
implies that /i(8) ^U. 

A scalar distribution Px is said to be (strictly) dominated by another distribution Py if Fx (x) < Fy {x) when- 
ever Fy{x) G (0, 1), and the relation is denoted by Px -<d Py- A distribution Px is called absolutely continuous 
w.r.t. another distribution Py, if Py{A) = implies Px{A) = for every A G S, where *B is the corresponding 
cr-algebra. This relation is denoted Px Py- If both distributions are absolutely continuous w.r.t. each other, then 
they are said to be equivalent. The total variation distance between Px and Py is defined as 

dTviPx,PY) = sup \PxiA) - Py{A)\ 
Ae<B 

A statement is said to be satisfied for Px-a.a. (almost all) x, if the set of x's for which it is satisfied has probability 
one under Px . 

In what follows we use conv(-) for the convex hull operator, |A| for the length of an interval A C R, log for 
log2, range(f ) for the range of a function /, and o for function composition. The indicator function over a set A is 
denoted by 1a(-)- A set A C is said to be convex in the direction u G R™, if the intersection of A with any line 
parallel to u is a connected set (possibly empty). Note that A is convex if and only if it is convex in any direction. 

The following simple lemma states that (up to discreteness issues) any real-valued r.v. can be shaped into a 
uniform rv. or vice versa, by applying the corresponding c.d.f or its inverse, respectively. This fact is found very 
useful in the sequel 

Lemma II.l. Let X ^ Px , Q ^ U be statistically independent. Then 

(i) F^\e)^Px- 

(ii) Fx{X) -Q-Px{X)^ U. Specifically if X is proper then Fx{X) - U. 

Proof. See Appendix lAl □ 

A proper real-valued r.v. X is said to have a regular tail if there exists some 7 G (0, ^] and positive constants 
Co, ci, ao, «!, such that 

co/?(x) < mm{Fx{x),l^Fx{x)) < c^f^^x) 
for any x G supp(X) satisfying min {Fx{x), 1 — Fx{x)) < 7. 

Lemma II.2. Let X be proper with supp(X) = R and a bounded unimodal p.d.f. fx. Each of the following 
conditions implies that X has a regular tail: 

(i) fx{x) = 0{\x\~°') and fx (x) = il{\x\~^) as \x\ — !• 00, for some b > a > 1. 

(ii) fx{x) = 0(e^''l^l ) and fx (x) = r2(e^''l^l ) as \x\ — > 00, for some a > 1 , 6 > 0. 

Proof See Appendix|C] □ 
Example II.l. If X is either Gaussian, Laplace or Cauchy distributed then X has a regular tail. 

B Information Theoretic Notions 

The relative entropy between two distributions Px and Py is denoted by D{Px\\Py). The mutual information 
between two rv.'s X and Y is denoted I{X; Y), and the differential entropy of a continuous r.v. X is denoted h{X). 
A memoryless channel is defined via (and usually identified with) a conditional probability distribution Py\x on R. 
The input alphabet X of the channel is the set of all a; G R for which the distribution Py\x{'\x) is defined, the 
output alphabet of the channel is the set y = UxeA' supp(F|X = x) C R. A sequence of real-valued rv. pairs 
{(X„, i^rOli^Li taking values m X y.y is, said to be an input/output sequence for the memoryless channel Py\x if 

Py,,\X-y^-^[-\x^,y''-^)=PY\x{-\xn), nGM (1) 

A probability distribution Px is said to be a (memoryless) input distribution for the channel Py\x if supp(X) C X. 
The pair {Px, Py\x) induces an output distribution Py over the output alphabet, a joint input/output distribution 
PxY, ™d an inverse channel Px\y- Such a pair [Px , Py\x) is called an input/channel pair if I{X] Y) < 00. 
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A channel for which both the input and output alphabets X,y are finite sets is called a discrete memoryless 
channel (DMC). Note that the numerical values of the inputs/outputs are practically irrelevant for a DMC, and hence 
in this case one can assume without loss of generality that X ~ {0, 1, . . . , \X\ — 1} and y = {0, 1, . . . , |3^| — 1}. 
Moreover, two input/DMC pairs {Px, Py\x) ™d {Px* , Py'\X') said to be equivalent if one can be obtained 
from the other by input and output permutations, i.e., there exist permutations ai : X i-^ X and (72 : 3^ n- 3^ such 
that 

Px{i)^Px4Mi)). Prixim ^ PY'\x'i<y2{j)Wii^)) 

for all i € X , j € y. In particular, equivalent pairs have the same mutual information. 

Let 8o be a random message point uniformly distributed over the unit interval, with its binary expansion repre- 
senting an infinite independent-identically-distributed (i.i.d.) Bernoulli (^) sequence to be reliably conveyed by a 
transmitter to a receiver over the channel Py\x- A. transmission scheme is a sequence of a-priori agreed upon mea- 
surable transmission functions gn ■ (0, 1) x y"-^^ i-> X, so that the input to the channel generated by the transmitter 
is given by 

^n = .g„(eo,r"-i), neM 

A transmission scheme induces a distribution ^x„|X"-iy"-i which together with ([T]i uniquely defines the joint 
distribution of the input/output sequence. In the special case where gn does not depend on y"^^, the transmission 
scheme is said to work without feedback and is otherwise said to work with feedback. 

A decoding rule is a sequence of measurable mappings {A„ : y" i—> £}'^^i, where £ is the set of all open 
intervals in (0, 1). We refer to A„(y") as the decoded interval. The error probability at time n associated with a 
transmission scheme and a decoding rule, is defined as 

Pe(n) ^ p(eo ^ A„(y")) 

and the corresponding rate at time n is defined to be 

i?„^--log|A„(F")| 
n 

We say that a transmission scheme together with a decoding rule achieve a rate R over a channel Py\x if 

lim P(i?„ <R)=Q, lim pe{n) = (2) 

The rate is achieved within an input constraint (r/, u), if in addition 

n 

lim n^^ vi^k) < u a.s. (element-wise) (3) 

n— >oo — ^ 

k=l 

where jy : A" i-> R™ is a measurable function and u G R™. A scheme and a decoding rule are also said to pointwise 
achieve a rate R if for all (?o G (0, 1) 

lim P(i?„ < i?|eo = ^o) = , lim p(eo ^ A„(y")|eo = ^^o) = o 

and to do the above within an input constraint (77, u) if (|3]l is also satisfied. Clearly, pointwise achievability implies 
achievability but not vice versa. Accordingly, a rate R is called (pointwise) achievable over a channel Py\x within an 
input constraint (77, u) if there exist a transmission scheme and a decoding rule (pointwise) achieving it. The capacity 
(with feedback) C{Py\Xi of the channel under the input constraint is the supremum of all the corresponding 
achievable ratefl It is well known that the capacity is given by ifTTI 

CiPY\x,V,u)^ sup I{X;Y) (4) 

Px: El)(X)<u 
supp(Jf)CA' 

Furthermore, the capacity without feedback (i.e., considering only schemes that work without feedback) is given by 
the above as well. The unconstrained capacity (i.e., when no input constraint is imposed) is denoted C{Py\x) for 
short. 

A pointwise capacity can be defined as well, and may be smaller than j4) depending on the channel. However, we do not pursue this direction. 
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An optimal fixed rate decoding rule with rate R is one that decodes an interval of length 2 " whose a-posteriori 
probability is maximal, i.e., 

A„(?;")= argmax Peo|y" (-''ly") 

{Je£: |J|=2-"-R} 

where ties are broken arbitrarily. This decoding rule minimizes the error probability p,, [n) for a fixed i?„ = R. An 
optimal variable rate decoding rule with a target error probability Pe (n) = (5„ is one that decodes a minimal-length 
interval whose accumulated a-posteriori probability exceeds 1 — (5„, i.e., 

A„(y") = argmin \J\ 

{,/££: :Peo|^"(./|a")>l-A-„} 

where ties are broken arbitrarily, thereby maximizing the instantaneous rate for a given error probability. Both 
decoding rules make use of the posterior distribution of the message point PQ^^y^^i'lu") which can be calculated 
online at both terminals. 

It should be noted that the main reason we adopt the above nonstandard definitions for channel coding with 
feedback, is that they result in a much cleaner analysis. It may not be immediately clear how this corresponds to 
the standard coding framework lfT2l . and in particular, how achievability as defined above translates into the actual 
reliable decoding of messages at a desired rate. The following Lemma justifies this alternative formalization. 

Lemma 11.3. Achievability as defined in (|2l) and (O above, implies achievability in the standard framework. 

Proof. See Appendix |A] Loosely speaking, a rate R is achievable in our framework if the posterior distribution 
Pejj|yn concentrates in an interval of size sa 2~"^ around 8o, as n grows large. This intuitively suggests that nR 
bits from the message point representation could be reliably decoded, or, more accurately, that the unit interval can 
be partitioned into w 2"^ intervals such that the one containing can be identified with high probability. □ 

C Markov Chains 

A Markov chain {^'nji^Li over a measurable state space ^, is a stochastic process defined via an initial distribution 
P*j^ on 5", and a stochastic kernel (conditional probability distribution) V, such that 

We say s G ^ is the initial point of the chain if ^ (s) = 1, and denote the probability distribution induced over 
the chain for an initial point s by P^- The Markov chain generated by sampling the original chain in steps of m is 
called the m-skeleton, and its kernel is denoted by P'". The chain is said to be P^-irreducible for a distribution P^ 
over ^, if any set A G S with P*(j4) > is reached in a finite number of steps with a positive probability for any 
initial point, where S is the corresponding cr-algebra over ^. P^i is said to be maximal for the chain if any other 
irreducibility distribution is absolutely continuous w.r.t. P*. A maximal P* -irreducible chain is said to be recurrent 
if for any initial point, the expected number of visits to any set A e *8 with P*(A) > 0, is infinite. The chain is 
said to be Harris recurrent, if any such set is visited infinitely often for any initial point. Thus, Harris recurrence 
implies recurrence but not vice versa. A set A e *8 is called invariant if V{A\s) = 1 for any s E A. An invariant 
distribution is one for which P*„_i = P* implies P*^ = P*. Such an invariant distribution is called ergodic if 
for every invariant set A either P^, {A) = or Pip [A) = 1. A chain which has (at least one) invariant distribution is 
called positive. For short, we use the acronym p.h.r to indicate positive Harris recurrence. A chain is said to have 
a d-cycle if its state space can be partitioned into d disjoint sets amongst which the chain moves cyclicly a.s. The 
largest d-cycle possible is called a period, and a chain is called aperiodic if its period equals one. 

The following results are taken from ifTSl and lfT4l . We will assume here that ^ is an open/closed set of R™ 
associated with the usual Borel cr-algebra *B, although the claims hold under more general conditions. 

Lemma II.4. An irreducible chain that has an invariant distribution is (positive) recurrent, and the invariant distri- 
bution is unique (and hence ergodic). 

Lemma II.5 (p.h.r conditions). Consider a chain with a kernel V. Each of the following conditions implies p.h.r.: 

(i) The chain has a unique invariant distribution P-^, andV{-\s) <SC Pxj, for any s £ ^. 

(ii) Some m-skeleton P™ is p.h.r 
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Lemma II.6 (p.h.r. convergence). Consider an aperiodic p.h.r. chain with a kernel V and an invariant distribution 
P^. Then for any s € ^ 

lim dry (7'"(-|s),Pvp) =0 

n— >oo 

Lemma II.7 (Strong law of large numbers ( SLLN) ). If is an ergodic invariant distribution for the Markov chain 
with kernel V, then for any measurable function r/ : 5^ i— > R satisfying E|r?(^')| < oo and P^,-a.a. initial 
point s G 5, 



1 

lim - V li^k) = Vs-a.s. 

n— ^cx3 Tl ^ — ^ 
fc=l 

Furthermore, if the chain is p.h.r then the above holds for any s G 5- 



D Iterated Function Systems 

Let 5^ be a measurable space, uj : x ^ ^ a measurable functiorH, and write ujy{-) = uj{y,-) for any y G R. Let 
{^}^i be an i.i.d. sequence of real-valued r.v.'s. An Iterated Function system (IFS) {Sn{s)}^^i is a stochastic 
process over ^, defined bjQ 

Si = S , Sn+l{s) ll^y-„ O UJy„_-^ O ■ ■■ OUJyi («) (5) 

A Reversed IFS (RIFS) {Sn (s)}5^j is a stochastic process over ^, obtained by a reversed order composition: 

Si = S , Sn+l{s) ^ UJy:^ ° l^Y2 ° ■ ■ ■ ° (s) (6) 

We say that the (R)IFS is generated by the (R)IFS kernel ujy{-), controlled by the sequence {Yn}^^i, and s is its 
initial point. Note that an IFS is a Markov chain over the state space ^, and in fact a large class of Markov chains 
can be represented by a suitable IFS ifTSl . In contrast, an RIFS is not a Markov chain but it is however useful in the 
analysis of the corresponding IFsQ see e.g. |[T6lll7l[T8l . However, in what follows the RIFS will turn out to have an 
independent significance. 

A function ^ : [0, 1] i-> [0, 1] is called a (generally nonlinear) contraction if it is nonnegative, n-convex, and 
£,{x) < X for any x G (0, 1]. 

Lemma II.8. For any contraction ^ ( • ) 

r(n) = sup ^'■"^(x), lim r{n) — 

x€[0,l] 

where ^'"^ is the n-fold iteration of ^. The sequence r{n) is called the decay profile of 

Proof. See AppendixlAl □ 

Example II.2. The function ^(x) = is a (hnear) contraction for < r < 1, with an exponential decay profile 

r{n) = r". 

Example II.3. The function ^(x) = x — ax^ is a contraction for a < -g and /3 > 1, with a polynomial decay profile 

r(n) = O (^nT^y 

In what follows, a measurable and surjective function ip : ^ [0, 1] is called a length function. We now state 
some useful convergence Lemmas for (R)IFS. 

Lemma II.9. Consider the IFS defined in Q, and suppose there exist a length function ■;/'(•) and a contraction ^(•) 
with a decay profile r{n), so that 

E[V(wv,(s))] < Vse^ (7) 

Then for any s Cz ^ and any e > 

¥{4iS„{s)) >e)< £-V(n) 



is equipped with the usual Borel cr-algebra, and K. X ^ is equipped with the corresponding product cr-algebra. 
*We call the process itself an IFS. In the literature sometimes u)y is the IFS and the process is defined separately. 

'The idea is that it is relatively simple to prove (under suitable contraction conditions) that the RIFS converges to a unique random fixed 
point a.s., and since the IFS and the RIFS have the same marginal distribution, the distribution of that fixed point must be the unique stationary 
distribution of the IFS. 
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Proof. See AppendixlAl □ 

In the sequel, we consider an IFS over the space of all c.d.f. functions over the open unit intervafl i.e., all 
monotone non-decreasing functions h : (0, 1) i— >■ (0, 1) for which coiiv(range(h)) = (0, 1). Furthermore, we define 
the following family of length functions over ^c- 



^^{h) ^ / \{h{x))dx, h€S, 

JO 

where A : [0, 1] i-^ [0, 1] is surjective, n-convex and symmetric about i. 
For any : H n- R and s, t G R, define 

\s-t\ t~^s 

Ds,t{-) and Ds{-) are called global and local Lipschitz operators respectively. 



(8) 



(9) 



Lemma 11.10. Consider the RIFS in ^ over some interval C R, and suppose the following conditions hold for 
some q > 0." 

r ^ sup E[i^,,tKJ]« < 1 (10) 



Then for any e > 



Proof. See AppendixlAl 



P 



(|5„(s)-5„(t) 



s,t G 5- 



□ 



Lemma 11.11 (From ^771/). Consider the RIFS in (O over the interval ^ = (0, 1). Let p : (0, 1) [1, oo) be a 
continuous function, and define 



J(s; t) = sup {/9(conv{s, t})} , isTs = E [ J(s; uj^^ (s))] , ^{x, z, a) 



1-r 



+ 2J(s;i) 



If 



r = sup E 



Pis) 



<1, 



then for any s,t G (0,1) and any £ > 



P 



Sn{s)-Snit) >e) <e-^*(s,t,r)-r" 



III Posterior Matching 

In this section, we introduce the idea of posterior matching and develop the corresponding framework. In Subsec- 
tion|A] a new fundamental principle for optimal communication with feedback is presented. This principle is applied 
in Subsection|B] to devise a general transmission scheme suitable for any given input/channel pair (Px,^y|x)3 
This scheme will later be shown (in Section to achieve any rate below the corresponding mutual information 
I{X; Y), under general conditions. A recursive representation of the scheme in a continuous alphabet setting is 
developed, where the recursion rule is given as a simple function of the input/channel pair {Px, Py\x)- A common 
framework for discrete, continuous and mixed alphabets is introduced in Subsection O and a corresponding unified 
recursive representation is provided. Several illustrative examples are discussed throughout the section, where in 
each the corresponding scheme is explicitly derived. In the special cases of the AWGN channel with a Gaussian 
input, and the BSC with a uniform input, it is demonstrated how the scheme reduces to the Schalkwijk-Kailath and 
Horstein schemes, respectively. 



is associated with the topology of pointwise convergence, and the corresponding Borel cr-algebra. 
'For instance. Fx may be selected to be capacity achieving for Py\x^ possibly under some desirable input constraints. 
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A The Basic Principle 

Suppose the receiver has observed the output sequence y", induced by a message point Og and an arbitrary trans- 
mission scheme used so far The receiver has possibly gained some information regarding the value of Qq via Y", 
but what is the information it is still missing? We argue that a natural candidate is any r.v. U with the following 
properties: 

(I) U is statistically independent of Y". 

(II) The message point can be a.s. uniquely recovered from (U, Y"). 

Intuitively, the first requirement guarantees that U represents "new information" not yet observed by the receiver, 
while the second requirement makes sure this information is "relevant" in terms of describing the message point. 
Following this line of thought, we suggest a simple principle for generating the next channel input: 

The transmission function gn+i should be selected so that Xn+i is Px- distributed, and is a fixed functiorV^ of 
some r.v. U satisfying properties ^ and ( Hil l. 

That way, the transmitter attempts to convey the missing information to the receiver, while at the same time 
satisfying the input constraints encapsulated in PyTI. We call this the posterior matching principle for reasons that 
will become clear immediately. Note that any transmission scheme adhering to the posterior matching principle, 
satisfies 

/(eo;r„+i|r") -/(eo,F";r„+i)-/(r„+i;y") = /(x„+i;r„+i)-/(r„+i;y") = /(X;y) (ii) 

The second equality follows from the memorylessness of the channel and the fact that Xn^i is a function of 
(00, F"). The last equality holds since Xn+i ^ Px, and since is independent of F", where the latter is 
implied by property ^ together with the memorylessness of the channel. Loosely speaking, a transmission scheme 
satisfying the posterior matching principle therefore conveys, on each channel use, "new information" pertaining 
to the message point that is equal to the associated one-shot mutual information. This is intuitively appealing, and 
gives some idea as to why such a scheme may be good. However, this property does not prove nor directly implies 
anything regarding achievability. It merely indicates that we have done "information lossless" processing when con- 
verting the one-shot channel into an n-shot channel, an obvious necessary condition. In fact, note we did not use 
property dUli, which turns out to be important! 

The rest of this paper is dedicated to the translation of the posterior matching principle into a viable transmission 
scheme, and to its analysis. As we shall see shortly, there are infinitely many transmission functions that satisfy the 
posterior matching principle. There is however one baseline scheme which is simple to express and analyze. 

B The Posterior Matching Scheme 

Theorem III.l {Posterior Matching Scheme). The following transmission scheme satisfies the posterior matching 
principle for any n: 

g„+i(0,2/")=F^io^^eo|y"(^|y") (12) 
Based on the above transmission functions, the input to the channel is a sequence of r.v.'s given by 

X„+i=^^^ioFe„|y. (Golr") (13) 

Proof. Assume Pep[yn(- 1?/") is proper for any e 3^". Then Lemma lll. 1 I claim (Hill implies that f ia„ | vn (0o|y") ^ 
U, and since this holds for all y" then Fq^\y^ (©oI^") ^ ^ and is statistically independent of F". It is easy to see 
that for any y", the mapping Fq^\y^ ('12/") is injective when its domain is restricted to supp [Pq^\y^ ('Ij/"))' thus 
00 can be a.s. uniquely recovered from {Fq^\y^ (0o|^")i F")- Hence, we conclude that Fq^\y^ (©oly") satisfies 
properties dUi and (HJi required by the posterior matching principle . By Lemma HlTI claim dU, applying the inverse 
c.d.f. F^^ merely shapes the uniform distribution into the distribution Px- Therefore, Xn+i is Px -distributed 
and since it is also a deterministic function of Fq^\y-^ (©oI^"). the posterior matching principle is satisfied. See 
Appendix lAl to eliminate the properness assumption. □ 

'"By fixed we mean that the function cannot depend on the outputs y", so that Xn+i is still independent of Y" . 

' ' The extra degree of freedom in the form of a deterministic function is in fact significant only when Px has a discrete part, in which case a 
quantization of U may void property flll . 

'^One can easily come up with useless schemes for which only property (I) holds. A simple example is repetition: Transmit the binary 
representation of ©o bit by bit over a BSC, independent of the feedback. 
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Following the above, it is now easy to derive a plethora of schemes satisfying the posterior matching principle. 

Corollary III.l. Let {^J■n}^=l be a sequences of u.p.f's, and let : (0, 1) i— > (0, be a sequence of 

measurable bijective functions. The transmission scheme given by 

gn+ii9,y") = o M„ o Peoir- ' ((0,?n {9)]) I?/") 

satisfies the posterior matching principle for any n. In particular, a scheme obtained by fixing ^„ = /i and ^„ to be 
the identity functioi^for all n, is called a ^-variant. The transmission scheme corresponding to a ^-variant is thus 
given by 

gn+i{e, y") = o ^ o ^^eoiy- W) (14) 
Finally, the baseline scheme ( 1721 ) is recovered by setting ^ to be the identity function. 

We note that the different schemes described above have a similar flavor. Loosely speaking, the message point 
is described each time at a resolution determined by the current uncertainty at the receiver, by somehow stretching 
and redistributing the posterior probability mass so that it matches the desired input distribution (we will later see 
that the "stretching rate" corresponds to the mutual information). This interpretation explains the posterior matching 
moniker. From this point forward we mostly limit our discussion to the baseline scheme described by ( fT2b or ( fT3] l. 
which is henceforth called the posterior matching scheme. The ji-variants (fl4] i of the scheme will be discussed in 
more detail on Section [VlIllAl 

As it turns out, the posterior matching scheme may sometimes admit a simple recursive form. 

Theorem III.2 {Recursive representation I). If Pxy is proper, then the posterior matching scheme ( 1721 ) is also given 

by 

gM = Fx\e) , gn+i{eWl = (^x' ° Fx\Y{-\yn)) o ff„(0|y"-^) (15) 

Moreover, the corresponding sequence of input/output pairs {(X„, i^n)}^i constitute a Markov chain over a state 
ipace supp(X, Y) C R^, with an invariant distribution Pxy, and satisfy the recursion rule 

Xl = F^l(eo) , Xn+l = Fx' O Fx\Y{Xn\Yn) (16) 

Proof. The initialization gi{9) ~ Fx'{0) results immediately from ( fT2b . recalling that Qq is uniform over the 
unit interval. To prove the recursion relation, we notice that since Pxy is proper then the transmission functions 
gn{0, y"'~') are continuous when restricted to the support of the posterior, and strictly increasing in 9 for any fixed 
^n-i Therefore, we have the following set of equalities; 

Fe,|..(0| y") - P(eo < e\ r" = y") P(g„(eo,y"-^) < gn{0, y"-^)| r" - y") 

= P(X„ < y„(^,y"-i)|y" = y") P(X„ < 5„(0, y"-i)| y„ = y„) 
=^Fx\Yi9niO,y''-')\yn) (17) 

where in (a) we used the continuity and monotonicity of the transmission functions, and in (b) we used the facts that 
the channel is memoryless and that by construction X„ is statistically independent of F"^^, which also imply that 
Y" is an i.i.d. sequence. The recursive rule (flSl l now results immediately by combining ( fT2] i and ( [TT] ). 
Now, using (T3[ we obtain 

Xn+i = Fx' o 7^eo,v"(eo| F") = Fx' o FxiYigniOo, F""')! Y^) = Fx' o Fx\Y{Xn\ r„) 

yielding relation (fTSl i. Since F„ is generated from Xn via a memoryless channel, the Markovity of {{Xn,Yn)}'^^^ 
is established. The distribution Pxy is invariant since by construction (X„,F„) ~ Pxy implies Xn+i ^ Px, 
and then y„+i is generated via the memoryless channel Py\x- Taking the state space to be supp(X, Y) is artificial 
here since P,Yi'(supp(X, Y) \ supp(X, Y)) = 0, and is done for reasons of mathematical convenience to avoid 
having trivial invariant distributions (this is not true when Pxy is not proper). Note that the chain emulates the 
"correct" input marginal and the "correct" joint (i.i.d.) output distribution; this interpretation is further discussed in 
Section IVm] □ 

'''in fact, letting <;„ be any sequence of monotonically increasing functions results in the same scheme. This fact is used in the error probability 
analysis on Section lvTl to obtain tighter bounds. 
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In the sequel, we refer to the function F^^ o Fx\y appearing in the recursive representation as the posterior 
matching kernel. Let us now turn to consider several examples, which are frequently revisited throughout the paper 

Example III.l {AWGN channel). Let Py\x be an AWGN channel with noise variance N, and let us set a Gaussian 
input distribution X ~ P), which is capacity achieving for an input power constraint P. We now derive the 
posterior matching scheme in this case, and show it reduces to the Schalkwijk-Kailath scheme. Let SNR = ^. 
Standard manipulations yield the following posterior distribution 

^"^-^--^(rrlR-^'TTki-^) '''' 

The joint p.d.f. fxY is Gaussian and hence proper, so the recursive representation of Theorem IIII. 21 is valid. By 
definition, the corresponding posterior matching kernel satisfies 

o Fx\Y{.x\y) = {z : Fx{z) = Fx\Y{x\y)} (19) 

However, from Gaussianity and ( fTSl l we know that 



Fx\Y{x\y) - Fx [VTTMR [x ■ V 



(20) 



Combining (fT9] l and ( [20] i. the posterior matching kernel for the AWGN channel setting is given by 



Fx' o Fx\Y{x\y) ^ VI + SNR (x - ^ ^^^^ ■ y ] (21) 



and hence the posterior matching scheme is given by 

Xi=F^\Qo), ^ Vl + SNR (^Xn - ^ ^^^^ (22) 

From the above we see that at time n + 1, the transmitter sends the error term pertaining to the MMSE estimate 
of Xn from y„, scaled to match the permissible input power P. In fact, it can be verified either by directly or 
using the equivalence stated in Theorem 1111.21 that Xn+i is the scaled MMSE term of X„ given the entire output 
sequence F". Therefore, the posterior matching scheme in this case is an infinite-horizon, variable-rate variant of 
the Schalkwijk-Kailath scheme. This variant is in fact even somewhat simpler than the original scheme |8l, since 
the initial matching step of the random message point makes transmission start at a steady-state. The fundamental 
difference between the posterior matching principle and the Schalkwijk-Kailath "parameter estimation" approach in 
a non-Gaussian setting, is now evident. According to Schalkwijk-Kailath one should transmit a scaled linear MMSE 
term given past observations, which is imcorrelated with these observations but not independent of them as dictated 
by the posterior matching principle; the two notions thus coincide only in the AWGN case. In fact, it can be shown 
that following the Schalkwijk-Kailath approach when the additive noise is not Gaussian results in achieving only the 
corresponding "Gaussian equivalent" capacity, see Example lV11.2l 

Example III.2 {BSC). Let Py\x be a BSC with crossover probabiUty p, and set a capacity achieving input distribu- 
tion X ^ Bernoulli (i), i.e., fx{x) ~ \ i^i^) + ^{x ~ !))■ We now derive the posterior matching scheme for this 
setting, and show it reduces to the Horstein scheme (Si. The conditions of Theorem lIII.2l are not satisfied since the 
input distribution is discrete, and we therefore use the original non-recursive representation (fT2l) for now. It is easy 
to see that the matching step F^"^ acts as a quantizer above/below i, and so we get 



X„+i=J^^loi^eolv"(eo|r") = | \ 



Go <mcdian{/e„i,-.(0|r")} 
o.w. 



which is precisely the Horstein scheme. The posterior matching principle is evident in this case, since slicing the 
posterior distribution at its median results in an input Xn+i ^ Bernoulli (i) given any possible output F" — y", 
and is hence independent of F" and Bernoulli(i)-distributed. We return to the BSC example later in this section, 
after we develop the necessary tools to provide an alternative (and more useful) recursive representation for the 
Horstein scheme. 
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Example in.3 (Uniform Input/Noise). Let Py\x be an additive noise channel with noise uniformly distributed over 
the unit interval, and set the input X ^ U, i.e., uniform over the unit interval as well. Let us derive the posterior 
matching scheme in this case. It is easy to verify that the inverse channel's p.d.f. is given by 

!x\Y[x\y)-<^ (2-y)-il(,_i,i)(.) ye {1,2) 

Since the conditions of Theorem lIII.2l are satisfied, we can use the recursive representation. We note that since the 
input distribution is U, the matching step is trivial and the posterior matching kernel is given by 

Fx o Fx\Y[x\y) ^ Fx\Y{x\y) ^ i ^ , ii „, ^ l^ o\ (^3) 

I 2-y 



+ 1[i,oo)(-t) ye (1,2) 



and therefore the posterior matching scheme is given by 



= Go , X„+i = ^ . 1(0,1] (>^rO + ^" • 1(1,2) (^rO (24) 

The above has in fact a very simple interpretation. The desired input distribution is uniform, so we start by 
transmitting the message point Xi = Oq. Then, given Yi we determine the range of inputs that could have generated 
this output value, and find an affine transformation that stretches this range to fill the entire unit interval. Applying 
this transformation to Xi generates X2- We now determine the range of possible inputs given Y2, and apply the 
corresponding affine transformation to X2, and so on. This is intuitively appealing since what we do in each iteration 
is just zoom-in on the remaining uncertainty region for Oq. Since the posterior distribution is always uniform, this 
zooming-in is linear 

The posterior distribution induced by this transmission strategy is uniform in an ever shrinking sequence of 
intervals. Therefore, a zero-error variable-rate decoding rule would be to simply decode at time n the (random) 
maximal interval J„ within which the posterior is uniform. The size of that interval is 

\Jn\ = X{YuX{{2~Yu) 

keJ k^J 

where J — {k : 1 < k < n ,Yk < 1}. Denoting the channel noise sequence by Z„ ~ Pz, the corresponding rate is 



Rn = log \Jn\ = - ^ log — + - ^ log — = - ^ log ■ 



fx\Y{Xk\Yk) 



1 v^, fz 
- V log — 

n JY 



^Yk °2-Yk ^ fx(Xk) 

keJ k^J k=l ■'^ ' 

^""^^'^ Elog/z(^)-Elog/y(y)=/(X;y) = iloge a.s. 



k=\ •< - V z-; 

where we have used the SLLN for the i.i.d. sequences Z^\Y". Therefore, in this simple case we were able to 
directly show that the posterior matching scheme, in conjunction with a simple variable rate decoding rule, achieves 
the mutual information with zero error probability. In the sequel, the achievability of the mutual information and 
the tradeoff between rate, error probability and transmission period obtained by the posterior matching scheme are 
derived for a general setting. We then revisit this example and provide the same results as above from this more 
general viewpoint. 

Example III.4 (Exponential Input/Noise). Consider an additive noise channel Py\x with ~ Exponential(l) noise, 
and set the input X ^ Exponential(l) as well. This selection is not claimed to be capacity achieving under any 
reasonable input constraints, yet it is instructive to study due to the simplicity of the resulting scheme. We will 
return to the exponential noise channel in Example 1111.71 after developing the necessary tools, and analyze it using 
the capacity achieving distribution under an input mean constraint. 

It is easy to verify that for the above simple selection, the input given the output is uniformly distributed, i.e., the 
inverse channel p.d.f./c.d.f. are given by 

1 X 
Ix\Y{x\y) = - • l(o,y)(a;) , Fx\Y{x\y) = - ' ^{o,v){x) + l[y,oo)(a;) 



11 



C The Normalized Channel 



III POSTERIOR MATCHING 

















Fx' 




Py\x 




Fy 



























Figure 1; The normalized channel P$|e 



The input's inverse c.d.f. is given by 



1 - s 



Therefore, the posterior matching kernel is given by 

F^'oFx\Y{x\y)=\n 

and the posterior matching scheme in this case is simply given by 

1 



y - X 



Xi = In 



= In 



Yr, 



(25) 



(26) 



C The Normalized Channel 

The recursive representation provided in Theorem llll.2l is inapplicable in many interesting cases, including DMCs in 
particular. In order to treat discrete, continuous and mixed alphabet inputs/channels within a common framework, we 
define for any input/channel pair {Px i Py\x) ^ corresponding normalized channel P,^\q with (0,1) as a common 
input/output alphabet, and a uniform input distribution Q ^ U. The normalized channel is obtained by viewing 
the matching operator i^^^(-) as part of the original channel, and applying the output c.d.f. operator Fy{-) to the 
channel's output, with the technical exception that whenever Fy (•) has a jump discontinuity the output is randomly 
selected uniformly over the jump spanO This is depicted in Figure[T] where ) stands for the aforementioned 
possibly random mapping. This construction is most simply formalized by 

PY\e{-\e)^PY\x{-\F],\e)), ^^Fy{Y)-Py{Y)-K (27) 

where Q ^ U, and A ^ U is statistically independent of {Q, Y). 

Lemma III.l (Normalized Channel Properties). Let {Pq, P^\e) be the normalized input/channel pair correspond- 
ing to the pair {Px, Py\x)- The following properties are satisfied: 

(i) $ ~ U, i.e., P$|e preserves the uniform distribution over the unit interval. 

(ii) The mutual information is preserved, i.e., 

/(e;$) = /(X;F) 

f Hi) The joint distribution Pe* proper 

(iv) The normalized kernel -F'e|*('^l</') '■^ continuous in 9 for P^-a.a. (f> G (0, 1). 

Proof. (i) By Lemma llLTI claim ^ we have F^^{Q) ^ Px, and so F ~ Py in (|27] |. The result now follows from 
Lemma HITI claim 

(ii) An easy exercise using the relations in Figure [T] and noting that X, Y are always uniquely recoverable from 
8, $ respectively. 

'''The output mapping is of a lesser importance, and is introduced mainly to provide a common framework. 
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(iii) See AppendixlAl 

(iv) Follows easily from (ImT l. 

□ 

The posterior matching scheme over the normalized channel with a uniform input, is given by 

The properties of the normalized channel allows for a unified recursive representation of the above scheme via 
the inverse normalized channel Pe|* corresponding to (Pe7^*|e) = (^i^^ie)' in terms of the normalized 
posterior matching kernel i^e|*- 

Theorem III.3 {Recursive representation II). The posterior matching scheme for the normalized channel is given 
by the recursive relation: 

giiO) = 0, gn+iiO\r) - ^e|*(-|'/'n) ° ffnl^l^""') (28) 

The corresponding sequence of input/output pairs {(On, ^n)}^^i constitutes a Markov chain over a state space 
supp(0, $) C (0, 1)^, with an invariant distribution Pe$. and satisfy the recursion rule 

Qi^Qo, e„+i =Fe|$(e„|$„) (29) 

Furthermore, \2% is equivalent to the posterior matching scheme ( 1731 ) in the sense that the distribution of the 
sequence |i^_^^(0„), i^y ^($„)}^^ .^^ coincides with the distribution of the sequence Yn)}'^^i- 

Proof. By Lemma UlI. II the joint distribution Pe* is proper, hence Theorem 1111. 21 is applicable and the recursive 
representations and Markovity follow immediately. Once again, taking the state space to be supp(0, $) and not 
supp(6, $) is artificial and is done for reasons of mathematical convenience, to avoid having the trivial invariant 
distributions Pq x -P*|e('|0) ™d Pi x P$|e(-|l), where Po(0) = l,Pi(l) = 1. The distribution Pe* is invariant 
by construction, and the equivalence to the original scheme is by definition. □ 

In the sequel, an initial point for the aforementioned Markov chain will be given by a fixed value Oq G (0, 1) 
of the message point onljLlj. Notice also that the Theorem above reveals an interesting fact: Whenever F^^ is not 
injective, the sequence of input/output pairs pertaining to the original posterior matching scheme (flJl i is a hidden 
Markov process. In particular, this is true for the BSC and the Horstein scheme. 

Example IIII.2I (BSC. continued). The normalized channel's p.d.f. corresponding to a BSC with crossover proba- 
bility p and a Bernoulli (i) input distribution is given by ft],\0{4)\O) — 2(1 — p) when 6, ip are either both smaller 
or both larger than i, and f^\e{4>\0) = 2p otherwise. Following Theorem 1111.31 and simple manipulations, the 
corresponding normalized posterior matching kernel is given by 

(2ii-p)e 0e (o,i),0e (o,|) 

Fe\<,m~Up0 0e(o,i),0e[i 1) ^^^^ 

[2(1 - p)e - (1 - 2p) 0e [i,l),0G 

and for a fixed cj) is supported on two functions of 9, depending on whether g ^ which corresponds to t/ = 0, 1 
in the original discrete setting, see Figure |2] Therefore, the posterior matching scheme (which is equivalent to the 
Horstein scheme in this case) is given by the following recursive representation: 



2(i-p)e„ e„e (0,i),$„e (0,i) 

o _o o _ 1 2pe„ + (1 - 2p) e„G [i,i),$„e (o,|) 



2(1 - p)0„ - (1 - 2p) 0„ e [i, 1), $„ e [i, 1) 

The hidden Markov process describing the original Horstein scheme is recovered from the above by setting 

Xk = F^\ek) = Iri i)(0fc) 7 Yk = F^\^k) = hii){<^k) 



"This is an abuse of notations, since an initial point is properly given by a pair (pi). However, it can be justified since 0i = ©o and <I>i 
is generated via a memoryless channel. Hence, any statement that holds for a.a/all initial points {9i , <j)i ) also holds in particular for a.a./all 8o . 
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Example III.5 {The binary erasure channel (BEG)). The binary erasure channel is defined over the input alphabet 
X = {0, 1} and the output alphabet y = {0, 1, 2}. Given any input, the output is equal to that input with probability 
p, and equal to 2 with probability I — p. Using the capacity achieving distribution Px = Bernoulli (i), it is easy to 
see from the non-recursive representation ( fT2] i that the posterior matching scheme in this case is exactly the simple 
repetition rule - transmit the first bit of 9o until it is correctly received, then continue to the next bit and so on. 
This scheme clearly achieves the capacity 1 — p. The recursive representation w.r.t. the normalized channel is very 
simple and intuitive here as well. The normalized posterior matching kernel is supported on three functions - the 
identity function corresponding to the erasure output 2, and the functions 29, 29 — 1 that correspond to the outputs 
0, 1 respectively. 

Example III.6 (General DMC). The case where Py\x is a DMC and Px is a corresponding discrete input distribu- 
tion is a simple extension of the BSC/BEC settings. The normalized posterior matching kernel is supported over a 
finite number of |3^| continuous functions, which are all quasi-affine relative to a fixed partition of the unit interval 
into subintervals corresponding to the input distribution. Precisely, for any x ^ X the normalized posterior matching 
kernel evaluated at9 — Fx{x) is given by 

Fe^^iFxixM) = Fxiy{x\F-\(I))) (31) 

and by a linear interpolation in between these points. Hence, the corresponding kernel slopes are given by — ^^^4^4^- 

Example III.7 (Exponential noise, input mean constraint). Consider an additive noise channel Py\x with ~ Exponcntial(6) 
noise, but now instead of arbitrarily assuming an exponential input distribution as in Example llII.4l let us impose an 
input mean constraint {x, a) , i.e.. 



lim n ^ \ Xk < a a.s. 



k=l 



The capacity achieving distribution under this input constraint was determined in |IT9| to be a mixture of a determin- 
istic distribution and an exponential distribution, with the following generalized p.d.f.: 

fxix) = — ^ S{x) + - — cxp 



b ' ' (a + b)^ ' \ a + b 

Under this input distribution the output is Y ^ Exponential(a + b), and the capacity can be expressed in closed 
form 

C = /(X;y) = log(l + ^) 

in a remarkable resemblance to the AWGN channel with an input power constraint. Interestingly, in this case the 
posterior matching scheme can also be written in closed form, and as stated later, also achieves the channel capacity 
under the input mean constraint. 
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To derive the scheme, we must resort to the normahzed representation since the input distribution is not proper 
The input's inverse c.d.f. and the output's c.d.f. are given by 

Using the normalized representation and practicing some algebra, we find that the normalized posterior matching 
kernel is given by 

Fe,,m = (1 - 0)* • ^ • 1(0,^) W + • T^) ' + 

(32) 

Thus the posterior matching scheme in this case is given by 

r «±i^.e„.(i-$„)f e„<^ 

01=00, e„+i= . ^ ^ , (33) 



a+6 l-e„ J ^ 

where the original channel's input/output pairs are given by 

X„ ^(a + b) In (e„) , y„ = (a + 6) In ^ 

and constitute a hidden Markov process. Note that since we have e„ G (0, 1 - ^^^^) a.s., then e„+i G (0, 1) 
a.s. and we need not worry about the rest of the thresholds appearing in ( [32] i. 

IV Regularity Conditions for Input/Channel Pairs 

In Section |V] we prove the optimality of the posterior matching scheme. However, to that end we first need to 
introduce several regularity conditions, and define some well behaved families of input/channel pairs. 
For any fixed (j) G (0, 1), define 9^ and 6'^ to be the unique solutions of 

respectively. For any e > 0, define the left-e-measure ~-F'||e('l^) of ^'i>|e('l'^) to have a density ~/||0 given by 

-fl>\e{m = inf /*|e(0lO, 

where the interval {(f), 9) is defined to be 

J-(,^,0) ^ (max(^^^,0-e),0) (34) 
Note that the left-e-measure is not a probability distribution since in general ~P||g|((0, 1)\9) < 1. Similarly define 

E.|f 



right-e-measure ^PLq{-\9) of P$|e(-|0) to have a density ^/Lq given by 



?eJ+(0,e) 

where the interval J+ (0, 9) is defined to be 

J+{c^,9)^{9,mm{9 + e,9+)) 

Note that lim^/|,|Q — lim+/|,|Q = /$|e a.e. over (0,1)^. Following these definitions, an input/channel pair 
{Px , Py\x) is said to be regular, if the corresponding normalized channel satisfies 



inf 

e>0 



DiP<,\e II 'Pile I ^e) + D{P^\e II +i^||e I Pe) 



Loosely speaking, the regularity property guarantees that the sensitivity of the channel law Py\x to input perturba- 
tions is not too high, or is at least attenuated by a proper selection of the input distribution Px- Regularity is satisfied 
in many interesting cases, as demonstrated in the following Lemma. 
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Lemma IV.l. Each of the following conditions implies that the input/channel pair {Px, Py\x) is regular: 

(i) h{Q,^) is finite, supp(0, $) is convex in the 0-direction, and /q^ is bounded away from zero over supp(0, $). 

(ii) PxY is proper, supp(X, Y) is convex in the x-direction, fx is bounded, and fx\Y has a uniformly bounded 
max-to-min ratio, i.e., 

( SUP:rGsupp(X|Y=y) ^ X\y{Av)\ 

sup . , -^'^^ f ( \ \ < °° 

yesupp(Y) \ iniKesiipp(X|Y=y) JX\Y[^\y) / 

(Hi) PxY is proper, and fx\Yix\y) is unimodal with a regular tail and a bounded variance, uniformly over y g 
supp(Y). 

(iv) Py\x is a DMC with nonzero transition probabilities. 

Proof. See Appendix|C] □ 

For an input/channnel pair {PxtPy\x), define the following set of properties: 
(A 1) {Px,Py\x) is regular 

(A2) The invariant distribution Pq^ for the Markov chain {(6,i, '^n)}^=i, is ergodic. 
(A3) -Pel* is fixed-point free, i.e., for any 9 G (0, 1). 

P(Fe|$(0|$) =0) < 1. (35) 

(A 4) Px achieves the unconstrained capacity over Py\x> '•^•> ^) ~ C{Pv\ y 

The following is easily observed. 
Lemma IV.2. (J® => (jQ. 

Proof. See proof of Lemma lVll.il □ 

Let be the family of all input/channel pairs satisfying properties (A[T]i and (A|2|i. Let fig be the family 
of all input/channel pairs satisfying properties (A[T|i, (AO and (AlHi. In the sequel, we show that for members in 
^Ia U fis the corresponding posterior matching scheme achieves the mutual information. However, while Lemma 
IIV. 1 1 provides means to verify the regularity Property (A[T]i, and Properties (Al3]l and (A|4]i are easy to check, the 
ergodicity property (A|2]i may be difficult to verify in general. Therefore, we introduce the following more tractable 
property: 

(A 5) fxY is bounded and continuous over siipp(X, Y), where the latter is connected and convex in the y-direction. 

We now show that (AO and (A|5]l together imply a stronger version of (A|2]i. In fact, to that end a weaker version of 
(AO is sufficient, which we state (for convenience) in terms of the non-normalized kernel: 

(A3*) For any x G supp(X) there exists y G siipp(Y), such that F^^ o Fx\Yix\y) 7^ x. 

Lemma IV.3. (/Q) A (43 => {(©n, is p.h.r and aperiodic ^ (43. 

Proof. For the first implication, see Appendix IbI The second implication is immediate since p.h.r. implies in 
particular a unique invariant distribution, which is hence ergodic. □ 

Following that, let us define Q,c to be the family of all input/channel pairs {Px, Py\x) satisfying properties 
(AlB, (AH) and (AO. 

Corollary IV.l. C flA. 



'^Since an input/channel pair has finite mutual information, (B(4j implies that C{Py\x) < c«- The unconstrained capacity is finite for discrete 
input and/or output channels, but can be finite under other input alphabet constraints (e.g., an amplitude constraint). 
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Turning to the discrete case, let {Px, Py\x) be an input/DMC pair. Without loss of generality, we will assume 
throughout that miiixex Px{x) > 0, as otherwise the unused input can be removed. Define the following set of 
properties: 

(Bl) min PY\x{y\x) > 0. 
xex.yey 

(B2) At least one of the following holds: 

(i) There exists some y E y with Pyiv) > 0, such that either Px -<d Px\Y{'\y) or Px\Y{'\y) '<d Px- 

(ii) There exist some yo, yi £ y with Pyiyo) > 0, Pyiyi) > 0, such that Px\Y{-\yo) ~<d ^x|y(-|yi)- 

(B3) Vx e X, 3yo, yi e y s.t. > |^ ^ Q, wher^B 
Lemma IV.4. Let {Px,Py\x) be an input/DMC pair. Then: 

(i) m (4B- 

(ii) f43 (43- 

(Hi) f4Z]) A fi© A f43 f43- 

(iv) 1^-1=2^ (43. 

(v) (^iTj) A /(X; F) > =^> there exists an equivalent pair {Px* , Py* |x* ) satisfying fj© A (43- 

fv/j For any e > there exists P^, such that drviPx, P'x) < '^^^ {P'x, Py\x) is input/DMC pair satisfying 

(m- 

Proof. Claim ^ follows immediately from condition (|iv]) of Lemma llV.ll Claim (|iv|l holds since any two noniden- 
tical binary distributions can be ordered by dominance. For the remaining claims, see Appendix|A] □ 

Remark IV.l. The equivalent pair in Lemma HV. 41 claim (|v)l, is obtained via an input permutation only, which is 
given explicitly in the proof and can be simply computed. 

V Achieving the Mutual Information 

Our main theorem is presented in Subsection |A] establishing the achievability of the mutual information via pos- 
terior matching for a large family of input/channel pairs. The examples of Section are then revisited, and the 
applicability of the theorem is verified in each. Subsection|B]is dedicated to the proof of the Theorem. 

A Main Result 

Theorem V.l {Achievability). Consider an input/channel pair {Px i Py\x) G U i^B (resp. ^c)- The corre- 
sponding posterior matching scheme with a fixed/variable rate optimal decoding rule, achieves ( resp. pointwise 
achieves) any rate R < I(^X]Y) over the channel Py\x- Furthermore, if {Px , Py\x) € (resp. r^cJ. fh^n 
R is achieved (resp. pointwise achieved) within an input constraint {r],Wjr]{X)), for any measurable rj : X t-^ R 
satisfying W]\ri{X)\ < oo. 

Example IIII.l I (AWGN. continued). Pxy is proper (jointly Gaussian), and the inverse channel's p.d.f. fx\Y{x\y) 
is Gaussian with a variance independent of y, hence by Lemma 111.21 condition ^ has a regular tail uniformly in 
y. Therefore, by condition dinl i of Lemma lIV.il the Gaussian input/ AWGN channel pair {Px, Py\x) is regular and 
Property (A[T]) is satisfied. It is easy to see that the linear posterior matching kernel ( |2T] ) is fixed-point free, and so 
Property (AtS]*) is satisfied as well. Finally, fxY is continuous and bounded over a support, so Property (A|5]l 

'^Q is the set of rational numbers. Note that there always exists a pair for which &^ < 0, but the quotient is not necessarily in'ational. 
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is also satisfied. Therefore {Px, Py\x) G ^C, and Theorem IV. 1 1 verifies the well known fact that the SchaUcwijk- 
Kailath scheme (pointwise) achieves any rate below the capacity I{X; Y) = ^ log (1 + SNR). 

Example IIII.2I (BSC, continued). The pair of a Bernoulli (i) input Px and a BSC Py\x with any nontrivial 
crossover probabiUty p ^ 0,1, satisfies properties (AlU and {01^. Properties (i4[T]) and (j^^ follow from claims (Qi, 
^ and (|iv]l of Lemma lTV.4l Hence (Px , Py \x) G and Theorem lV. 1 [ implies that the posterior matching scheme, 
which coincides in this case with the Horstein scheme, indeed achieves the capacity I{X; Y) = 1 — hb{p). This 
settles in the affinnative a longstanding conjecture. 

Remark V.l. In the BSC Example above, it also holds (via Lemma irv!4l claim (jmli) that {Px,Py\x) •= for 
a.a. crossover probabilities p, except perhaps for the countable set 5 = {p : i^iogfi-p] ^ ^ ^ where property (-BO 
is not satisfied. In these cases the ergodicity property (A|2]i is not guaranteed, though this may be an artifact of the 
proof (see Remark lA.lb . Therefore, although capacity is achieved for any p (via Hb), Theorem lV. 1 1 guarantees the 
empirical distribution of the input sequence X" to approach Px only for p ^ S. However, since Px is the unique 
capacity achieving distribution, this sample-path property of the input sequence holds for p G S nonetheless (see 
Remark IVSTl. 

Remark V.2. Interestingly, for p ^ S the Horstein medians exhibit "regular behavior", meaning that any median 
point can always be returned to in a fixed number of steps. In fact, for the subset of S where i^"iog°i!^p) — for 
some positive integer k > 2, the Horstein scheme can be interpreted as a simple finite-state constrained encoder that 
precludes subsequences of more than k consecutive O's or I's, together with an insertion mechanism repeating any 
erroneously received bit fc + 1 times. This fact was identified and utilized in Q to prove achievability in this special 
case. 

Example IIII.3I (Uniform input/noise, continued). Pxv is proper with a bounded p.d.f. over the convex support 
siipp(X, Y) ~ (0, 1) X (0, 2), the marginal p.d.f. fx is bounded, and the inverse channel's p.d.f. is uniform hence 
has a bounded max-to-min ratio. Therefore, condition ^ of Lemma H V. 1 1 holds, and properties (y4[TJ and (A|5l) are 
satisfied. It is readily verified that the kernel (l23T i is fixed-point free, and so property (A[3f ) is satisfied as well. 
Therefore (Py, Py\x) G ^c- and Theorem lV. 1 I reverifies that the simple posterior matching scheme (|24] | pointwise 
achieves the mutual information I{X; Y) = ^ log e, as previously established by direct calculation. In fact, we have 
already seen that (variable-rate) zero-error decoding is possible in this case, and in the next section we arrive at the 
same conclusion from a different angle. 

Example lIII.4l (Exponential input/noise, continued). Pxy is proper with a bounded p.d.f. over the convex support 
siipp(X, Y) = 1R+ X R+, the marginal p.d.f. fx is bounded, and the inverse channel's p.d.f. is uniform hence has 
a bounded max-to-min ratio. Therefore, condition ^ of Lemma IIV. 1 1 holds, and properties (A[T]i and (i4|5]l are 
satisfied. It is readily verified that the kernel (IZSl l is fixed-point free, and so property (^[3]*) is satisfied as well. 
Therefore {Px, Py\x) € ^c- and so by Theorem IV. II the posterior matching scheme ( l26b pointwise achieves the 
mutual information, which is this case is I{X; Y) « 0.8327. 

Example IIII.6I (General DMC, continued). It has already been demonstrated that the posterior matching scheme 
achieves the capacity of the BSC. We now show that the same holds true for a general DMC, up to some minor 
resolvable technicalities. Let Py\x be a DMC with nonzero transition probabilities, and set Px to be capacity 
achieving (unconstrained). Hence properties (B[TJ and are satisfied, and by Lemma llV4l claim (Qi, property 
(y4[T]) holds as well. The corresponding posterior matching scheme in this case is equivalent to a generalized Horstein 
scheme, which was conjectured to achieve the unconstrained capacity when there are no fixed points, namely when 
property (v4[3]l is satisfied |21 Section 4.6]. Since in this case {Px , Py\x) G ^b. Theorem IV. II verifies that this 
conjecture indeed holds. Moreover, the restriction of not having fixed points is in fact superfluous, since by Lemma 
IIV.4I claim Q, there always exists an equivalent input/DMC pair (obtained simply by an input permutation) for 
which the posterior matching scheme is capacity achieving. This scheme can be easily translated into an equivalent 
optimal scheme for the original channel Py-\x, which is in fact one of the many /-t-variants satisfying the posterior 
matching principle mentioned in Corollary IIII.ll where the u.p.f. fj, plays the role of the input permutation. This 
observation is further discussed and generalized in Section lVTlllAl 

More generally, let Px be any input distribution for Py\x, e g- capacity achieving under some input constraints. 
If the associated kernel is fixed-point free ((A[3l) holds) and (B[3]i is satisfied, then by Lemma llV4l claim (En]), we 
have that (y4|2]i holds as well. This implies {Px, Py\x) •= ^A, and hence by Theorem IV. II the associated posterior 
matching scheme achieves rates up to the corresponding mutual information I{X; Y), within any input constraints 
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encapsulated in Px- Again, the fixed-point requirement is superfluous, and achievability within the same input 
constraints can be guaranteed via a posterior matching scheme for an equivalent channel (or the corresponding 
/i-variant), for which the kernel is fixed-point free. 

It is worth noting that requiring property (BO to hold is practically nonrestrictive. For any fixed alphabet sizes 
\X\, \y\, there is only a countable number of input/channel pairs that fail to satisfy this property. Moreover, even if 
{Px , Py\x) does not satisfy (BO, then by Lemma llV.41 claim (jvlli, we can find an input distribution P^ arbitrarily 
close (in total variation) to Px, such that (-B(3]l does hold for (Px, Py\x)- Hence, the posterior matching scheme 
(or a suitable variant, if there are fixed points) for (Px, Py\x) achieves rates arbitrarily close to I(X;Y) while 
maintaining any input constraint encapsulated in Px arbitrarily well. 

Remark V.3. For input/DMC pairs such that {Px, Py\x) ^ but where (-BO does not hold, ergodicity is not 
guaranteed (see also Remark [A. 11 1. Therefore, although the (unconstrained) capacity is achieved, the empirical 
distribution of the input sequence X" will not necessarily approach Px, unless Px is the unique capacity achieving 
distribution for Py\x (see Remark|V5]l. 

Remark V.4. The nonzero DMC transition probabihties restriction (jg[T]i is mainly intended to guarantee that the 
regularity property (A[Tll is satisfied (although this property holds under somewhat more general conditions, e.g., for 
the BEC). However, regularity can be defined in a less restricting fashion so that this restriction could be removed. 
Roughly speaking, this can be done by redefining the left-e-measure and right-e-measure of Section|IV]so that the 
neighborhoods over which the infimum is taken shrink near some finite collection of points in (0, 1), and not only 
near the endpoints, thereby allowing "holes" in the conditional densities. For simplicity of exposition, this extension 
was left out. 

Example IIII.7I (Exponential noise with an input mean constraint, continued). This example is not immediately 
covered by the Lemmas developed. However, studying the input/channel pair (Py|e, Pq) (namely, the normalized 
pair but without the artificial output transformation), we see that Pqy satisfies property (A|5]l, and the correspond- 
ing posterior matching kernel (which is easily derived from ( |32] |) is fixed-point free, hence property (AO") is also 
satisfied. Proving that this is a regular pair is straightforward but requires some work. Loosely speaking, it stems 
from the fact that /y|e(y|^) is monotonically decreasing in y for any fixed 6, and has a one-sided regular tail. 
Therefore, the posterior matching scheme ( |33] | pointwise achieves any rate below the mean-constrained capacity 



Let us start by providing a rough outline of the proof. First, we show that zero rate is achievable, i.e., any fixed 
interval around the message point accumulates a posterior probability mass that tends to one. This is done by noting 
that the time evolution of the posterior c.d.f. Fq^^^^ can be represented by an IFS over the space ^c, generated by the 
inverse channel's c.d.f. via function composition, and controlled by the channel outputs. Showing that the inverse 
channel's c.d.f. is contractive on the average (Lemma I V. II ). we conclude that the posterior c.d.f. tends to a unit 
step function about Oq (Lemma [V.2l ) which verifies zero-rate achievability. For positive rates, we use the SLLN for 
Markov chains to show that the posterior p.d.f. at the message point is 2"^^'^'^) (Lemma FYSI l. Loosely speaking, 
a point that cannot be distinguished from Oq must induce, from the receiver's perspective, about the same input 
sequence as does the true message point. Since the normalized inputs are just the posterior c.d.f. sequence evaluated 
at the message point, this means that such points will also have about the same c.d.f. sequence as 8o does, hence 
also will have a posterior p.d.f. w 2"^("^'^). But that is only possible within an interval no larger than w 2""^'^'^^ 
around 8o, since the posterior p.d.f. integrates to unity. Thus, points that cannot be distinguished from Oq must 
be 2~"^("^'^'' close to it. This is more of a converse, but essentially the same ideas can be applied (Lemma [V.4| i to 
show that for any R < I{X; Y), a 2^"^ neighborhood of the message point accumulates (with high probability) a 
posterior probability mass exceeding some fixed e > at some point during the first n channel uses. This essentially 
reduces the problem to the zero-rate setting, which was already solved. 
We begin by establishing the required technical Lemmas. 

Lemma V.l. Let {Px, Py\x) satisfy property (^40. Then there exist a contraction ^(•) and a length function V-'a(') 
as in dS]) over ^c, such that for any h G 



/(X;y) = log(l + f). 



B Proof of Theorem EI 




(36) 
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Proof. See AppendixlAl □ 
Define the stochastic process {G„}5^]^, 

G„(-)=54-,*""') 

Since gn{0, 4>"'~^) ~ -Peol*""^ (^'l</'"'^)> G„ is the posterior c.d.f. of the message point after observing the i.i.d. 
output sequence (f>"^^, and is a rv. taking values in the c.d.f. space ^c- Moreover, by (l28T l we have that 



G„+i =F0|4, (•!$„) oG„ (37) 

and therefore {G„}^]^ is an IFS over generated by the normalized posterior matching kernel ^e|<i'('l0) (^i^ 
function composition) and controlled by the outputs {$r, }^i- Since the message point is uniform, the IFS initializes 
at Gi{9) = 01 (0.1) (^) + l[i,oo)(0) (the uniform c.d.f.). Recall that the normalized kernel is continuous in 6 for P,^- 
a.a. (j) ("Lemma flll.il claim (|Iv]l), hence G„ is a.s. continuous. 

We find it convenient to define the 5-positive trajectory and 5-negative trajectory as 

follows: 



+ei^Gfc(eo + A+), A+ = min(<5, 



1-eo 

2 



-ef ^Gfc(eo-A,-), A,- = mill (^<5, (38) 

These trajectories are essentially the posterior c.d.f. evaluated after k steps at a 5 perturbation from Qq (up to edge 
issues), or alternatively the induced normalized input sequence for such a perturbation from the point of view of the 
receiver. The true normalized input sequence, which corresponds to the c.d.f. evaluated at the message point itself, 

is efc = Gfc(eo). 

The next Lemma shows that for a zero rate, the trajectories diverge towards the boundaries of the unit interval 
with probability approaching one, hence our scheme has a vanishing error probability in this special case. 

Lemma V.2. Let {Px,Py\x) ^^ttisfy property (^3^. Then for any e > 0,S > 0, 

F{-ei >s) = (y^^) , F{+ei <l-e) = [i/Hn)) 
where r(n) is the decay profile of the contraction ^l-) from Lemma [VJ\ 



Proof. Let tp^ and be the length function and contraction from Lemma [VTI corresponding to the pair {Px , Py\x)^ 
and let r{n) be the decay profile of ^. By the contraction property (|36] l and Lemma |lL9l we immediately have that 
for any > 

P(Va(G„)>^) <^"'K") (39) 

Define the (random) median point of G„: 

e:^illf|^?e(o,l):G„W>^ 

Since G„ is a.s. continuous, Gri(8* ) = | is a.s. satisfied. Using the symmetry of the function A(-), we can write 

^.{Gn) ^ T" x{Gn{e))de + [ x{i ~ Gn{e))de a.s. (40) 



and then: 

P (G„(e: ~S)>iy) < P (A (G„(e: - S)) >iy)<F{ I A (G„(e)) dO > i^S \ < P(V'^(G„) > j^S) 



e*,-<5 



where (a) holds since X{6) > 6 for any 9 G (0, ^), in (b) we use the monotonicity of G„, and (c) follows from ( |40l i. 
Using ( [39] l this leads to 

F{G^{Q:^-S)>i^) <—r{n) (41) 
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and similarly 

P (G„ {Q;^ + S)<l-iy)<— r{n) (42) 

Now set any 77 G (0, |), and write 

°GniO)de + {I - Gn{0)) d0 > i)j 

(a) /r /-e: ri 
< p 



G„(0)d0 + (1 - G„((?)) rf^ > ^1 u {le; - eol > 

(b) / - l/X / zy\('^)2 / I/' 

< P [MGn) > 2 ) + - ®ol > 2) - ^ '■^''^ + ^ V'®" - > 2 
(d) 2 



(c) 2 
< 



-r{n) + p ({G„(eo) > G„(e; + ^)} u {G„(eo) < G„(e: - ^)} 
^ r(,i) + p (G„(eo) ^ (?7, 1 - + p (G„(e; - ^) > + p (G„(e: + ^) < 1 - ,y) 



(f) 2 4 

< - r{n) + 2ri-\ r{n) (43) 

V vrj 

where in (a) we use the fact that integrals differ only over the interval between 9o and 0* and the integrands are 
bounded by unity, in (b) we use the union bound, and then (|39] l by noting that applying A(-) can only increase the 
integrands, in (c) we use ( [39l ) and (d) holds by the continuity and monotonicity of G„. These properties are applied 
again together with the union bound in (e), and the inequality holds for any 7; g (0, |). Finally in (f) we use ( |4T1 - I42b 
the fact that G„(0o) = Qn is uniformly distributed over the unit interval. Choosing 77 = \/r(n) we get 

eo 



P / Gn{0)d9+ {l-Gn{9))de > v\<cu-^^/Hn) 



for c > 0. The same bound clearly holds separately for each of the two integrals above. Define the set 

^ |^? G (0, 1) : P ^ " Gn{e)dd >v\Qo = e^> cv-^ ^K^) | 

Then 

P " Gn{B)de > = EP ^y " G,,{B)dB >v\Q^> P(eo S H^) • cv'^ ^/Hnj 

and we get P(0o G H^) < •y/r(n). Let us now set i^n ~ \/ r{n), and suppose n is large enough so that f„ < 
Recalling the definition of the negative trajectory ^Ofj, we have 

P ("9^ > e) = EP (-6* > e I 60) < ^ P ' Gn{e)de > | • min{5, 9} Qq = 0^ d9 

" i ^ (/ ° '^"(^^'^^ > min{7.„, y} I ©0 = 0^ d9 
< P(eo e n^j + / d9+ cv-^ i/Hn)d9 

Jo J2iy„e-i 

The result for P {+0^ < 1 — e) is proved via the exact same arguments. □ 
Lemma V.3. Let {Px-, Py\x) satisfy property (/^. Then the posterior p.d.f. evaluated at the message point satisfies 

lim ilog/e„|$40ol*") = ^(^;>^) a.s. (44) 
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Proof. Since the p.d.f.'s involved are all proper, we can use Bayes law to obtain the following recursion rule: 

/*„|*"-i(<Pri 10"-^)' 

(45) 

where in the second equality we have used the memoryless channel property and the fact that the output sequence 
is an i.i.d. sequence with marginal U. Applying the recursion rule n times, taking a logarithm and evaluating at 
the message point, we obtain 

1 1 " 1 " 

-log/e„|*.(eo|<i>") = -^log/*|e ($fe|.9fc (60,*'=-')) = - ^ log /$|e(<i>fe I 6,.) 

k=l k=l 

Now by property (A|2|i the invariant distribution Pe* is ergodic, and so we can use the SLLN for Markov chains 
(Lemma [ll.7l l which asserts in this case that for Ps-a.a. 9o E (0, 1) 

lim 1 log/e„|$.(0o|1'")=E flog ^^^if^l^) =7(9;$) = p0„-a.s. 

Since 9o Pe, dm is estabHshed. □ 

For short, let us now define the (71, R)-positive trajectory '^^'^ and the (n, R)-negative trajectory as 
the corresponding trajectories in ( |38] ) with 5 = 2^"^. Accordingly, we also write A^^, in lieu of A+,A7 
respectively. The following Lemma uses the SLLN to demonstrate how, for rates lower than the mutual information, 
these two trajectories eventually move away from some small and essentially fixed neighborhood of the input, with 
probability approaching one. This is achieved by essentially proving a more subtle version of Lemma[V3j showing 
that it roughly holds at the vicinity of the message point. 

Lemma V.4. Let [Px, Py\x) satisfy properties and(J^. Then for any rate R < I{X; Y) there exists e > 
small enough such that 

^Too ^ ( n {^©^'^ - < (^^ ) }) = (46) 

Proof We prove the first assertion of ( |46] |. the second assertion follows through essentially the same way. Let S > 
be such that R < I{X;Y) — 6. Let ^P^^q{-\9) be the left-e-measure corresponding to P$|q(-|6'), as defined in 
Section |IV] Define: 

4- 4Elog-/||e($|e)= / / /*|e('/'l^)log7lie('/'l^)'^^# (47) 



supp(G,$) 

We have that 

< I{X; Y) - I- = 7(6; $) - 7," = D(F*|e || -P||e I ^e) 

and since by property (A[T]i the input/channel is regular then inf£>o 73(P$|e || ^P||q | Pe) < 00. hence for any e 
small enough 

-00 < J- < I{X;Y) 

We have therefore established that the function /$|e log ~/||q is finitely integrable for any e > small enough, 
and converges to /$|e log /$|e a.e in a monotonically nondecreasing fashion, as e — )■ 0. Applying Levi's monotone 
convergence Theorem |20l, we can exchange the order of the limit and the integration to obtain 

lim 77 = I(X; Y) 

Let us set e hereinafter so that 

I- -e>I{X;Y)-^ 
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Since 1^ is finite we can once again apply (using property (A|2ll) the SLLN for Markov chains (Lemma III. 7b to 
obtain 

1 " 

lim - V log -/||e(*fe|efe) = /- a.s. (48) 

k=l 

The above intuitive relation roughly means that if the receiver, when considering the likelihood of a wrong message 
point, obtains an induced input sequence which is always e-close to the true input sequence, then the posterior p. d.f. 
at this wrong message point will be close to that of the true message point given in Lemma IVS] 
Define the following two sequences of events: 



fc=i fe=i 



(49) 



where the neighborhood J is defined in ( [34] i. Let us now show that lim P(iJ„ e) = 0. This fact will then be 
shown to imply lim ¥{En.e) = 0, which is precisely the first assertion in ( |46] |. Define the following sequence of 



events: 



r„., ^ {Go > 2-"^} n I i 5^ log 7l|e(*fc|0^) > nx; F) - ^ I 

Using ( |48] l and the fact that the message point is uniform over the unit interval, it is immediately clear that P(r„.e) — > 
1. For short, define the random interval Jn.R = (Bo — ®o)' consider the following chain of inequaUties: 



> logE (e„ ~ -e;:'^) ^ logE (G„(eo) - G„(eo - A;^)) > logE A;^ • inf (/eo|$"-i (0 I 
> log (^E (^A- j, ■ ^ mf ^/e„|*n-i {9 \ | n r„,,^ • P(£;„,, n T„,e)) 



(a) 



> E(log A,7 ^ I E„.e n Tn.e) + E log inf rr /$|e($fc| Gfe(0)) E„^, n T,,,^ + log F{E„^, n T„,e) 

I \ 

> -ni? - 1 + E V log inf /<i,|e($fc| ^^n.s n T„,, + logP(£;„,e n T^,,) 

> -7ii?-i + E( ^iog7||e($fc|efc)|£;„,,nT„ J +iogP(£;„,, nT„,e) 

> -nR - 1 + Y) - -) + \ogV{E„^, D T,,^,) 

In (a) we use Jensen's inequality and the expansion of the posterior p. d.f. given in ( |45] |. in (b) we use the definition 
and monotonicity of Gk, (c) holds due to En^e and (d) due to T„ e. Therefore, 

p(£;„,, n r„,,) < 2-«(-f(^;'>')-f-«)-i < 2-"5-i _^ o (50) 

where the last inequality holds since R < I{X; Y) — S. Now, since P(r„ 1, then for any 77 > we have 
P(T„ e) > 1 — 77 for n large enough. Using that and dSOl l we bound P(£'„.e) simply as follows: 

^{En,e) < nEn,e H T„.,) + (1 - P(r„,e)) < 2-"^-^ + 7? < 27^ (51) 

where the last two inequalities are true for 77, large enough. Since (ISTT i holds for any 77 > 0, we conclude that 
F{En^e) — >■ 0, as desired. 

To finalize the proof, note that En.e implies that for any 1 < fc < 77 — 1 

and the rightmost inequality implies 
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V ACHIEVING THE MUTUAL INFORMATION 



The above constraints imply that for any 1 < k < n — 2 

establishing the impUcation En.e i?„-i,e. Consequently, P(i?„^e) < F{En-i.e), thus lim ¥{En,e) =0. □ 



We are now finally in a position to prove Theorem IV. 1 1 for the family fi/i. Loosely speaking, we build on the 
simple fact that since the chain is stationary by construction, one can imagine transmission to have started at any 
time TO with a message point 8m replacing Qo- With some abuse of notations, we define 

-e|(e„) ^ Fe|$(-|$™+;.-i) o . . . o ^^e|*(-|$m+i) ° Feii (q^ - min (^e, | 

Namely, the e-negative trajectory when starting at time m from 9„i- Note that in particular we have ^0|(9o) = 
Since the chain is stationary, the distribution of ^0|(6„i) is independent of m. The corresponding positive 
trajectory +6|(0m) can be defined in the same manner 

Now recall the event En^e defined in (|49] l, which by Lemma |V41 satisfies P(£'„_e) — > for any e > small 
enough. Note that the complementary event E'f^ ^ implies that at some time m < n, the (n, i?)-negative trajectory 
is below the e-neighborhood of 9„i, namely "O^^ < &m — min(e, for some to. Using the mono- 
tonicity of the transmission functions, this in turn implies that the (n, i?)-negative trajectory at time n lies below 
the corresponding e-negative trajectory starting from 8„i, namely ~0"'^ < ~8^_„(8„i). Thus we conclude that 
P ("85^"^ > max (^8^_„(8m))) ^ for any fixed e > small enough, where the maximum is taken over 
1 < TO < n. 

Fixing any a > 0, we now show that "8"]^^^-)^^ in probability; 
IP("0a+„)„ >S)<r (^-8-« > ^nmx^-e^,_„(e™)) +P ({-8;'it)« > n <^nmx^-8^„_„(e„)}) 
< 0(1) +P (^max^-8[i+„)„_„(8„) > s\ < o{l) + ^ P (-e^i+„)„_,„(e™) > 

n 

0(1) + ^ P (-efi+„)„_„(8o) > S) 0(1) + 6-^0{n^/^) (52) 



In (a) we used the union bound, in (b) the fact that the chain is stationary, and Lemma [V2l was invoked in (c) where 
we recall that (AO by Lemma llV.21 Therefore, if r{n) ~ o{n~^) then 8"^^^^^^ ^ in probability 

is established. However, for the more general statement we note that this mild constrainlij is in fact superfluous. 
This stems from the fact that the union bound in (a) is very loose since the trajectories are all controlled by the same 
output sequence, and from the uniformity in the initial point in Lemma |lL9l In AppendixlA] Lemma Ia31 we show 
that in fact 



P I max 

. l<m<n 



According to (|52] |. this in turn implies that 8"^'^^^^^ in probability without the additional constraint on the 
decay profile. 

The same derivation applies to the positive trajectory, resulting in ^"{^^^^ ""^ 1 in probability. Therefore, for 
any 6 > 0, 

P f +8";"^ , - -8"f , < 1 - 25) < P f-8"f , >(5l + P f +8";'^ , <l-s) = 0(1) 

\^ {l+a)n (1+Q)n J — \ {l+a)n J \ (l+a)n J ^ ' 

and so the posterior probability mass within a 2~"^ symmetric neighborhood of 80 (up to edge issues) after {l + a)n 
iterations, approaches one in probability as n 00. We can therefore find a sequence 6n such that the 
probability this mass exceeds 1 — (5„ tends to zero. Using the optimal variable rate decoding rule and setting the 
target error probability to Pe{n) = Sn we immediately have that P(i?„ < {1 + a)~^ R) — >■ 0. This holds for any 
R < I{X; Y), and since a > can be arbitrarily small, any rate below the mutual information is achievable. 



'*An exponentially decaying r{n) can in fact be guaranteed by requiring the normalized posterior matching kernel to be fixed-point free in a 
somewhat stronger sense than that implied by property (j4[2), which also holds in particular in all the examples considered in this paper. 
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To prove achievability using the optimal fixed rate decoding rule, note that any variable-rate rule achieving some 
rate R > induces a fixed-rate rule achieving an arbitrarily close rate i? — e, by extending the variable-sized decoded 
interval into a larger one of a fixed size 2~"(^~') whenever the former is smaller, and declaring an error otherwise. 
Therefore, any rate R < I{X; Y) is achievable using the optimal fixed rate decoding rule. The fact that the input 
constraint is satisfied follows immediately from the SLLN since the marginal invariant distribution for the input is 
Px- This concludes the achievability proof for the family ^Ia- 

Extending the proof to the family Hb requires reproving Lemma |V3] and a variation of Lemma [v!4l where the 
ergodicity property is replaced with the maximality property {/^. This is done via the ergodic decomposi- 
tion lfT4l for the associated stationary Markov chain. The proof appears in Appendix lAl Lemma lA~4l Achievability 
for the family flc has already been established, since ftc C JIa- The stronger pointwise achievability statement for 

is obtained via p.h.r. properties of the associated Markov chain, by essentially showing that Lemmas IV.2I IV. 3 1 
and lV.4| hold given any fixed message point. The proof appears in Appendix|B] Lemma [bTI 

Remark V.S. For {Px, Py\x) •= \ ^a, although the unconstrained capacity C{Py\x) is achieved, there is no 
guarantee on the sample path behavior of the input, which may generally differ from the expected behavior dictated 
by Px , and depend on the ergodic component the chain lies in. However, if Px is the unique input distributior0 such 
that I{X; Y) ~ C{Py\x)^ then the sample path behavior will nevertheless follow Px independent of the ergodic 
component. This is made precise in AppendixlAl Lemma |A4l 



In this section, we provide two sufficient conditions on the target error probability facilitating the achievability of a 
given rate using the corresponding optimal variable rate decoding rule. The approach here is substantially different 
from that of the previous subsection, and the derivations are much simpler. However, the obtained result is applicable 
only to rates below some thresholds R*,R^ . Unfortunately, it is currently unknown under what conditions do these 
thresholds equal the mutual information, rendering the previous section indispensable. 

Loosely speaking, the basic idea is the following. After having observed say the receiver has some estimate 
9n+i for the next input Qn+i- Then {On+i, $") correspond to a unique estimate 60 of the message point which is 
recovered by reversing the transmission scheme, i.e., running a RIFS over (0, 1) generated by the kernel ^ 
(the functional inverse of the normalized posterior matching kernel), controlled by the output sequence 

<i>", and initialized at dn+i- In practice however, the receiver decodes an interval and therefore to attain a specific 
target error probability pe (n), one can tentatively decode a subinterval of (0, 1) in which 8,1+1 lies with probability 
1 — Pe{n), which since Qn+i ^ ^, is any interval of length 1 — Pe{n). The endpoints of this interval are then 
"rolled back" via the RIFS to recover the decoded interval w.r.t. the message point Qq. The target error probability 
decay which facilitates the achievability of a given rate is determined by the convergence rate of the RIFS, which 
also corresponds to the maximal information rate supported by this analysis. 

This general principle relating rate and error probability to the convergence properties of the corresponding RIFS, 
facilitates the use of any RIFS contraction condition for convergence. The only limitation stems from the fact that 
ujif,{-) generating the RIFS is an inverse c.d.f. over the unit interval and hence never globally contractive, so only 
contraction on the average conditions can be used. The Theorems appearing below make use of the principle above 
in conjunction with the contraction Lemmas mentioned in Section HHIdI to obtain two different expressions tying 
error probabilities, rate and transmission period. The discussion above is made precise in the course of the proofs. 

Denote the family of all continuous functions p : (0, 1) i-> [1, 00) by C. 

Theorem VLl. Let {Px, Py\x) be an input/channel pair, {Pq, ^$|e) f^e corresponding normalized pair, and let 



VI Error Probability Analysis 



= For any pe C, defii 



me 




where Ds{-) is defined in and let 



R^ = supi?t(p) 



Uniqueness of the capacity acliieving distribution for Py\x does not generally imply the same for the corresponding normalized channel 
P^\S- For example, the normalized channel for a BSC/BernouUi pair, has an uncountably infinite number of capacity achieving distribu- 
tions. 
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If > 0, then the posterior matching scheme with an optimal variable rate decoding rule achieves any rate 
R < R\ by setting the target error probability to satisfy pe (n) — > under the constraint 

*((l-aK(n),l-ap,(n),2-«'('')) = o (2 "(«'('')-^)) (53) 

for some a £ (0,1), and some p G C such that R^{p) > R, where 'i/ is defined in Lemma Ul.l 1\ 

Proof. Let S'„(s) be the RIFS generated by ijJ^{ ) = F^^'^i-l't') and the control sequence {^k}'kLi^ initiahzed at 
s G (0, 1). Select a fixed interval Ji = {s,t) C (0, 1) as the decoded interval w.r.t. &„+!■ Since 9„+i ^ U, we 
have that 

P(e„+i e Ji) = |Ji| 

Define the corresponding interval at the origin to be 

and set it to be the decoded interval, i.e., A„($") = J„. Note that the endpoints of J„ are rv.'s. Since i^Qj^j,(-|0) is 
invertible for any (p, the interval J„ corresponds to 0i, namely, 

p(ei e j„) = EP(ei e j„ I $") = EP(e„+i g Ji | $") = P(e„+i e jo = | Ji| 

and then in particular (recall that 9o = 6i) 

Pe(n)=P(eo^A($")) = l-|Ji| 

For a variable rate decoding rule, the target error probability is set in advance. Therefore, given Pe{n) the length 
of the interval Ji is constrained to be | Ji | = 1 — pe{n), and so without loss of generality we can parameterize the 
endpoints of Ji by 

{s,t) = ((1 - a)pe{n), 1 - apein)) 

for some a e (0, 1). 

Now let p E C, and define 

r(p) = sup E<^ — Ds[uj,i,] 



se(o.i) 



Note that the expectation above is taken w.r.t. ^ ^ U. Using Lemma lll.lll if r{p) < 1 then 

P(|Ji| >e)=p(|^„(s)-^„(t)| >£) <£-i*(5,t,r(p)) t'^Ip) 

To find the probability that the decoded interval is larger than 2~"^, we substitute e — 2^"^ and obtain 

P(i?„ < i?) = P (I Jil > 2-"«) < 2"« • *((1 - a)pein), 1 - ape{n),r{p)) ■ 2'''°sr{p) 

Following the above and defining R^{p) = — \ogr{p), a sufficient condition for P(-R,i < i?) — > for R < RHp) 
is given by ( |53] ). The proof is concluded by taking the supremum over p G C, and noting that if no p results in a 
contraction then R^ < 0. □ 

Theorem I VI. H is very general in the sense of not imposing any constrains on the input/channel pair. It is however 
rather difficult to identify a weight function p that will result in R^{p) > 0. Our next error probability result is less 
general (e.g., does not apply to discrete alphabets), yet is much easier to work with. Although it also involves an 
optimization step over a set of functions, it is usually easier to find a function which results in a positive rate as the 
examples that follow demonstrate. 

The basic idea is similar only now we essentially work with the original chain and so the RIFS evolves over 
supp(X), generated by the kernel 0Jy{-) = |V ('12/) ° (the functional inverse of the posterior matching kernel), 
and controlled by the i.i.d. output sequence {Yk}'^^^. To state the result we need some definitions first. Let Px be 
some input distribution and let p : supp(X) 1— >■ (a, b) be differentiable and monotonically increasing (a, 6 may be 
infinite). The family of all such functions p for which fp[x) is bounded is denoted by J-{X). Furthermore, for a 
proper rv. X with a support over a (possibly infinite) interval, we define the tail function Tx ■ P^ [0, 1] to be 

Txii) = l-sup{Px{{x,x + e)) -.xeR} 

Namely, Tx (^) is the minimal probability that can be assigned by Px outside an open interval of length £. 
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Theorem VI.2. Let [Px , Py\x) be an input/channel pair with fxY conf/nwoM^ over supp(X, Y), and let ojy{-) = 
F^^yi'ly) ° For any p £ ■F{X), define 

i?*(p)^lim inf (-q-HogW^lDsAp^^Yop-^)]") 

(3^0+s,terangc(p)^ ^ 

and let 

1* A 



R* = sup R*{p) 



The following statements hold: 



( i) The posterior matching scheme with an optimal variable rate decoding rule achieves any rate R < R*, by 
setting the target error probability to satisfy pe (n) — > under the constraint 

PeW=rp(x)(o(2"(^'('')-«))) (54) 

for some p G F{X) satisfying R*{p) > R- 

( ii) /f I range (p) | < oo then any rate R < R* (p) can be achieved with zero error probability^ 
(Hi) If it is possible to write p o tUy o p~^{s) — u{s)v{y) + q(y), then 

= -Elog|w(r)| -log sup \u'{s)\ 

s£rango(p) 

whenever the right-hand-side exists. 

Proof. We first prove the three statements in the special case where Px has a support over a (possibly infinite) 
interval, and considering only the identity function pi : siipp(X) i-> siipp(X) over the support, i.e., discussing the 
achievability of R*{pi) exclusively. We therefore implicitly assume here that pi E F{X). Let Sn{s) be the RIFS 
generated by LOy{-) = -F'y|V('I2^) ° ^x and the control sequence {V/c}^!, which evolves over the space supp(X). 
Select a fixed interval Ji = (s, t) C supp(X) as the decoded interval w.r.t. Since Xn+i ^ Px we have that 

P(X„+i e Ji) = Px{Ji) 

Define the corresponding interval at the origin to be 

J71 = {Sn{s), Sn{t)) 

and following the same lines as in the proof of the preceding Theorem, J„ is set to be the decoded interval w.r.t. 
Xi = Fx(Oo), and so the decoded interval for Go is set to be A„{Y") = Fx{Jn)- Thus, 

Pe{n) = P(eo ^ Fx{Jn)) = P(^l ^ Jn) = 1 - ^A" ( Jl) 

For any q> Q define 

Tq = sup E [Dsj{uJy)]'^ 

.•57^tesupp(X) 

Using Jensen's inequality we have that for any < q < p 

=supE[2?,,t(w^)]' = supE[i?,,t(w,,)]''^ <sup(E[D,,tKOf)^ = ( supE ) = {rp)i 

s^t s^t s^t \ s^t I 

(55) 

Now suppose there exists some q* > Q s,o that r^. < 1. Using (ISST i we conclude that < 1 for any i) < q < q*, 
and using Lemma Hi. lOl we have that for any Q < q < q* and any e > 

P(|5„(s)-^„(i)| >e) <e-*|s-t|V" 



^"This is not a standard zero-error achievability claim, since the rate is generally random. If a fixed rate must be guaranteed, then the error 
probability will be equal to the probability of "outage", i.e., the probability that the variable decoding rate falls below the rate threshold. 



27 



VI ERROR PROBABILITY ANALYSIS 



and thus 

P(i?„ < i?) = P (Px((§„(s),5„(t))) > 2-"«) < P(A/ • I J„| > 2-"«) < M-92"«'^|Ji|V," 
where M = sup fx{x). A sufficient condition for P(i?„ < i?) — > is given by 

\Jl\ = {2-^(1-' losr-'- 



Since the above depends only on the length of Ji, we can optimize over its position to obtain pel"^) = ^ — Px {Ji) = 
73s:(|>/i I), or arbitrarily close to that. Therefore, any rate i? < q^^logr~^ is achievable by setting pe(n) under 
the constraint 

Pein)^Tx{\Ji\)=Tx (o (2"(9"^i°g-."'-«))) 
We would now like to maximize the term q^^ log r^^ over the selection of < q < q*. Using {55[ we obtain 

q^^ log (r,)"^ > q~^ log (rp)"p > p^^ log (r^)"^ 
and so q~^ log is nonincreasing with q, thus 

sup ^"Mogr"^ = lim (7"Mogr"^ = 

0<q<q* 9^0+ 

where pi is the identity function over supp(X). From the discussion above it is easily verified that R*{pi) > 
iff r^. < 1 for some q* > 0. Moreover, if |supp(X)| = Mq < 00 then Tx{i) ~ for any £ > Mq, therefore 
in this case Pe{n) = for any n large enough. Note that since pi is defined only over supp(X), we have that 
|rangc(pi)| = |supp(X)| = Mq. Thus, statements ^ and (jn]) are estabhshed for an input distribution with support 
over an interval, and the specific selection of the identity function p = pi. 

As for statement ( HiH i. note first that since fxY is continuous then ujy{s) is jointly differentiable in y, s. Suppose 
that ujy{s) — u{s)v{y) + q(y), and so u, v, q are all differentiable. In this separable case we have 



Ht)-u{s)\Y 



s^t \ \t-s\ J 



u{t) — u{s) 



E|v(y)|« •sup|u'(s)|'? 



and 

q-Hogr-^ = -g"MogE|w(y)|« - sup log |w'(s)| < -Elog|w(r)| - sup log |w'(s)| 

s s 

where we have used Jensen's inequality in the last inequality. We now show that the limit of the left-hand-side above 
as q — > 0+ in fact attains the right-hand-side bound (assuming it exists), which is similar to the derivation of the 
Shannon entropy asalimit ofRenyi entropies. Since E log is assumed to exist then we have logE|?;(F)|* — > 

as q — > 0+, and so to take the limit we need to use L'Hospital's rule. To that end, for any < q < q* 



-W.\v{Y)\'i = - j fY{y)\v{Y)\Uy^ J —fy{y)\v{Y)\^dy = loge- / fYiyMY)\nog\v{Y)\ dy 



loge-E(|w(y)|«log|u(y)| 



and thus 



R*ipi) = lim ( -q-MogE|v(r)|« -suplog|u'(s)| I = lim ( --^ logE|w(y)|« ) - suplog |w'(s)| 
i?->o+ V s J 9->o+ \ dq J s 

( ^{\v{Y)\nog\v{Y)\) \ , , „ , , , „ 

= ^hm I ^ ^y{Y)\i ] -'^^Plogl" = -Elog|«(r)|-suplog|M (s)| 

Which established statement (Iml i in the special case under discussion. The derivations above all hold under the 
assumption that the right-hand- side above exists. 

Treating the general case is now a simple extension. Consider a general input distribution Px (with a p.d.f. 
continuous over its support), and a differentiable and monotonically increasing function p : supp(X) i-> (a, b). Let 
us define a p-normalized channel Pyp\xp by connecting the operator p^^(-) to the channel's input. Let us Consider 
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the posterior matching scheme for the p-normalized input/channel pair {Pp(x) , Pyp\xp)- Using the monotonicity of 
p, the corresponding input and inverse channel c.d.f's are given by 

Fp{X) = Fx o P^^ , FxP\Yp{-\y) = Fx\Y{-\y) o 
The posterior matching kernel is therefore given by 

F;'x)°FxpiYp{-\y) = P° [Fx' o FxiY{-\y)) op-^ 

and the corresponding RJFS kernel is the functional inverse of the above, i.e., 

'p{-\y)) = p° (^x|V(-ly) ° -^-x) °p^^ = poujyop-^ 



,F„(x\ ° Fxp\Y 



Now, using the monotonicity of p it is readily verified that the input/channel pairs {Px^ Py\x) and {Pp[x) i Pyp\xp) 
correspond to the same normalized channel. Hence, the corresponding posterior matching schemes are equivalent, 
in the sense that { {Xk, ^fcOl^^i and { (p"^ (X^), ijf )}^]^ have the same joint distribution. Therefore, the preceding 
analysis holds for the input/channel pair {Pp(x) i Pyp \xp)^ and the result follows immediately. □ 

Loosely speaking, the optimization step in both Theorems has a similar task - changing the scale by which 
distances are measured so that the RIFS kernel appears contractive. In Theorem lVI.il the weight functions multiply 
the local slope of the RJFS. In Theorem IVI.2I the approach is in a sense complementing, since the functions are 
applied to the RIFS kernel itself, thereby shaping the slopes directly. These functions will therefore be referred to as 
shaping functions. 

Example IIII.l I (AWGN. continued). Returning to the AWGN channel setting with a Gaussian input, we can now 
determine the tradeoff between rate, error probability and transmission period obtained by the Schalkwijk-Kailath 
scheme. Inverting the kernel (l2Tl i we obtain the RIFS kernel 

, , „_1 s SNR 

= Fx,y{Fxis)\y) = ^/^^g^ + Y^g^ y 
Setting the identity shaping function pi{s) = s, the condition of Theorem lVI.2l statement diiil i holds and so 



R*iPi) = - log sup 



d 

d^ VVl + SNR 



= i log(l + SNR) = C 



so in this case R* = C, and statement dU reconfirms that the Schalkwijk-Kailath scheme achieves capacity. Using 
standard bounds for the Gaussian distribution, the Gaussian tail function (for the input distribution) satisfies 

Plugging the above into (l54l i. we find that a rate R < C is achievable by setting the target error probability to 

-logPeH = -logTx (o (2"(«*-«))) = o (22"(^-«)) 

recovering the well known double-exponential behavior Note that since the interval contraction factor in this case 
is independent of the output sequence, the variable-rate decoding rule is in fact fixed-rate, hence the same double- 
exponential performance is obtained using a fixed-rate decoding rule. 

We mention here the well known fact that for the AWGN channel, the error probability can be made to decay 
as a higher order exponential in the block length, via adaptations of the Schalkwijk-Kailath scheme 12111221 . These 
adaptations exploit the discreteness of the message set especially at the last stages of transmission, and are not 
directly applicable within our framework, since we define error probability in terms of intervals and not discrete 
messages. They can only be applied to the equivalent standard scheme obtained via Lemma HO] 



Example IIII.2I (BSC, continued). The conditions of Theorem IVI.2I are not satisfied in the BSC setting, and we 
resort to Theorem IVI.ll Inverting the posterior matching kernel (l30l l pertaining to the corresponding normalized 
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channel, we obtain the RIFS kernel 



and 



,se (o,i-p),<^G (o,i) 
( \ F-w J [i-p,i),(/.e (o,i) 



2^ .se (O,i-P),0e (0,i) 
1 [i-p,i),0e (0 i) 



E 



Using a constant weight function (i.e., no weights) does not work in this case, since the average of slopes for (say) 
s G (0,p), is 

- + ^) - ' 

In fact, any bounded weight function will result in the same problem for s > small enough, which suggests that the 
weight function should diverge to infinity as s — !• 0. Setting p{s) = for /? > 1 is a good selection for s G (0,p) 
since in that case 

f P(^$(g)) 

which can be made smaller than unity by properly selecting /?. Setting p symmetric around ^ duplicates the above 
to s e (1 — p, 1). However, this selection (and some variants) do not seem to work in the range s £ {p, i), for which 
< 1 is required. Finding a weight function p for which i?^ (^p) > q (jf exists at all) seems to be a difficult task, 
which we were unable to accomplish thus far 

Example JIO] (Uniform input/noise, continued). We have already seen that achieving the mutual information with 
zero error decoding is possible in the uniform noise/input setting. Let us now derive this fact via Theorem lVI.2l The 
output p.d.f. is given by 

/y(2/) = yl(o,i](2/) + (2-y)l(i,2)(2;) 
The RIFS kernel is obtained by inverting the posterior matching kernel ( l23T i. which yields 

= F^lyiFxis)\y) = s (2;l(o,i](y) + (2 - 2/)l(i,2)(y)) + iv ~ l)l(i.2)(2/) = sfyiv) + {y - l)l(i,2)(y) 

Using the identity shaping function pi again (but now restricted to supp(X) = (0, 1)), the condition of statement 
(|m]i holds and therefore 

= -Elog/y(r)- sup logl = hiY)^I{X;Y) 

se(o,i) 

and we have R* = R*{pi) = I{X;Y), thereby verifying once again that the mutual information is achievable. 
Since range(pi) = (0, 1) is bounded, statement ^ reconfirms that variable-rate zero error decoding is possible. 

Example IIII.4I (Exponential input/noise, continued). Let us return to the additive noise channel with an exponen- 
tially distributed noise and input. We have already seen that the posterior matching scheme (|26] | achieves the mutual 
information, which in this case is I{X; Y) k, 0.8327. The p.d.f. of the corresponding output is 

fviy) = 2/e"^l(o,oo)(2/) 

It is easily verified that F^^y iAy) ~ ™d ^o the RIFS kernel is given by 

i^y{s)=F^\y{Fx{s)\y)^y{l~e-') 

Now, using Theorem IVI.2I with the identity shaping function pi restricted to siipp(X) = (0,c>d), the condition of 
statement duili holds and therefore 



R* [pi) = -E log Y - log sup 

se(o,oo 



as 



= -Elegy « -0.61 < 



30 



VU EXTENSIONS 



Thus, the identity function is not a good choice in this case, and we must look for a different shaping function. Let 
us set p2{s) — 2^ which results inap.d.f. and c.d.f. 

/p2(X)(s) = ^ exp (-s^^) Fp^ix){s) = exp {-s'^) 



and 



P2 0UjyO p^'^{s) ^ [y{l- exp {-s ^))] = 
Since fp^ (x) is bounded and the above again satisfies the condition of statement (ImT i. we obtain 



R*{p2) = - Elegy- log sup 



-Elegy 

2 



inf 

se[o.oo) 



ds 
lege 



[(1- 



exp I 



Slogs + -log(l -cxp(-s"^)) = - Elegy w 0.305 



1 



where the infimum above is attained as s oo. The tail function of Pp^^x) is bounded by 

rp,(x)W < 1 - exp (-r^) < 

Thus, any rate R < R*{p2) ~ 0.305 is achieved by the posterior matching scheme 1111.41 using a variable decoding 
rule if the target error probability is set to 

1 

Pein) 



(22"(«*(P2)--R)) 



and so the following error exponent is achievable: 



lim - log = 2{R*{p2) - i?) « 0.61 - 2R 

n-i-oo n Pe{n) 

Although we know from Theorem IV. 11 that any rate up to the mutual information is achieved in this case, /02( ) is 
the best shaping function we have found, and so our error analysis is valid only up to the rate R*{p2) ~ 0.305 < 
IiX;Y). 



VII Extensions 

A The /i-Variants of the Posterior Matching Scheme 

In this subsection we return to discuss the /i-variants ( fT4] i of the baseline posterior matching scheme addressed thus 
far. To understand why these variants are of interest, let us first establish the necessity of a fixed-point free kernel 
(thereby also proving Lemma llV.2b . 

Lemma VII.l. If(j^i3i does not hold, then does not hold either and the corresponding scheme cannot achieve 
any positive rate. 

Proof. By the assumption in the Lemma, there must exists some fixed-point 0/ € (0, 1) such that 

P(Fe|*(0/|$)=0/) =1 

The posterior c.d.f. Fq^i^^ is obtained by an iterated composition of the kernel Pe|$(6'|0) controlled by the 

i.i.d. output sequence Thus, the fixed point aiOj induces a fixed point for the posterior c.d.f at 0j as well, since 

n 

P (Feo|$n(0;|<i>") ^ 0/) > n P (^^eo|*(^/l*fe) = Of) = 1 

k=l 

This immediately implies that no positive rate can be achieved, since the posterior probability of the interval (0, 6*^ ) 
remains fixed at 6* j . Stated differently, this simply means that the output sequence provides no information regarding 
whether Oq < 6*/ or not. For practically the same reason, the invariant distribution Pe* for the Markov chain 
{(6„,$„)}^iisnotergodic,sincetheset(0,6'/)x(0,l)isinvariantyet0 < Pe$((0,6'/) x (0,1)) = 6*/ < 1. □ 
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Suppose our kernel has L fixed points, and so following the above the unit interval can be partitioned into a total 
of L + 1 corresponding invariant intervals. One external way to try and handle the fixed-point problem is to decode a 
disjoint union of L + 1 exponentially small intervals (one per invariant interval) in which the message point lies with 
high probability, and then resolve the remaining ambiguity using some simple non-feedback zero-rate code. This 
seems reasonable, yet there are two caveats. First, the maximal achievable rate in an invariant interval may generally 
be smaller than the mutual information, incurring a penalty in rate. Second, the invariant distribution Pq^ is not 
ergodic, and it is likely that any encapsulated input constraints will not be satisfied (i.e., not pathwise but only in 
expectation over invariant intervals). A better idea is to map our message into the invariant interval with the maximal 
achievable rate, which is always at least as high as the mutual information. This corresponds to a posterior matching 
scheme with a different input distribution (using only some of the inputs), and resolves the rate problem, but not the 
input constraint problem. We must therefore look for a different type of solution. 

Fortunately, it turns out that the fixed points phenomena is in many cases just an artifact of the specific ordering 
imposed on the inputs, induced by the selection of the posterior c.d.f. in the posterior matching rule. In many 
cases, imposing a different ordering can eliminate this artifact altogether. We have already encountered that in the 
DMC setting (Example IIII.6I in Section |Vl using Lemma IIV.4l i. where in the case a fixed point exists, a simple 
input permutation was shown to be sufficient in order for the posterior matching scheme (matched to the equivalent 
input/channel pair) to achieve capacity. This permutation can be interpreted as inducing a different order over the 
inputs, and the scheme for the equivalent pair can be interpreted as a specific /x- variant of the original scheme. 

These observations provide motivation to extend the notion of equivalence between input/channel pairs from the 
discrete case to the general case. Two input/channel pairs {Px , Py\x) and {Px' , Py |x* ) are said to be equivalent 
if there exist u.p.f 's /i, cr such that the corresponding normalized channels satisfy 

p^,\e,{■\e)^p^\Q{a{■)\^i{e)) 

for any 9 e (0, 1). This practically means that the asterisked normalized channel is obtained by applying ^ and cr^^ 
to the input and output of the asterisk-free normalized channel, respectively, and in this case we also say that the pair 
(Py, Py\x) is ^-related to the pair [Px* , PyiX')- Again, equivalent input/channel pairs have the same mutual 
information. Following this, for every u.p.f. jj, and every set of input/channel pairs F, we define /i(r) to be the set of 
all input/channel pairs to which some pair in F is /x-related. The following result follows through immediately from 
the developments in Sections|lIl]and|V] and the discussion above. 

Theorem VII.l. For any input/channel pair {Px, Py\x) '^'^d any u.p.f. /i, the corresponding ^-variant posterior 
matching scheme ( 1741 ) has the following properties: 

(i) It admits a recursive representation w.r.t. the normalized channel, with a kernel p o F^-i(q'^\^[-\4>) o /i^^, i.e., 

Oi = m(0o) , B„+i = F^-i(e)|$(-|$„) o ^"^(e„) 

(ii) If (Px , Py\x) G fJ- i^A^ ^b) (resp. ii{Qc))' the scheme achieves (resp. pointwise achieves) any rate 
R < I(X;Y) over the channel Py\x- Furthermore, if {Px t Py\x) G M (^A U fJ^) then this is achieved 
within an input constraint (rj, W,r](X)), for any measurable 77 : A" H- R satisfying 'Ej\rj{X) \ < 00. 

Theorem I VII. II expands the set of input/channel pairs for which some variant of the posterior matching scheme 
achieves the mutual information, by allowing different orderings of the inputs to eliminate the fixed point phenomena. 
For the DMC case, we have already seen that considering yu-variants is sometimes crucial for achieving capacity. 
Next we describe perhaps a more lucid (although very synthetic) example, making the same point for continuous 
alphabets. 

Example VII.l. Let the memoryless channel Py\x be defined by the following input to output relation: 

Y = X^ + Z 

where the noise Z is statistically independent of the input X. Suppose that some input constraints are imposed so 
that the capacity is finite, and also such that the capacity achieving distribution does not have a mass point at zero. 
Now assume that an input zero mean constraint is additionally imposed. It is easy to see that the capacity achieving 
distribution Px is now symmetric around zero, i.e., Px{{—oo, 0)) = Py((0, 00)) = i . It is immediately clear that 
the output of the channel provides no information regarding the sign of the input, hence the corresponding posterior 
matching kernel ^ o Fx | y ( • | y ) has a fixed point at the origin, and equivalently, the normalized kernel Fq [ $ ( • | </>) 



32 



B Channel Model Mismatch 



VII EXTENSIONS 



has a fixed point at 9 = \. Thus, by Lemma IVII. li the scheme cannot attain any positive rate. Intuitively, this stems 
from the fact that information has been coded in the sign of the input, or the most-significant-bit of the message 
point, which cannot be recovered. To circumvent this problem we can change the ordering of the input, which is 
effectively achieved by using one of the /i-variants of the posterior matching scheme. For example, set 

{ 6 + \ (0,i] 
I 0e(|,l) 

and use the corresponding /i- variant scheme. This maintains the same input distribution while breaking the symmetry 
around \, and eliminating the fixed point phenomena. This /i-variant scheme can therefore achieve the mutual 
information, assuming all the other conditions are satisfied. 



B Channel Model Mismatch 



In this subsection we discuss the model mismatch case, where the scheme is designed according to the wrong 
channel model. We assume that the transmitter and receiver are both unaware of the situation, or at least do not take 
advantage of it. To that end, for any pair {Px , Py\x) •= we define a mismatch set ri™'^(PY , Py\x) consisting of 
all input/channel pairs {Px' , Py \X' with a corresponding normalized channel P$» |e* , that admit the following 
properties: 



(CI) {Px*,Py'\X') satisfies (45i, and M [i5(P<j,. |e. irP||e I ^^e) + |e* ll+^'lie I ^e)J < oo. 

(C2) DiPY*\X'\\PY\x\Px')<^. 

(C3) F^\Fx\y{X*\Y*)) ^ Px^ 

(C4) Let {Y*}'^^i be the channel output sequence when the posterior matching scheme for {Px, Py\x) '■s used 
over Py*\x* ond initialized with Xi ^ Px' ■ There is a contraction ^ and a length function tpx over ^c, such 
that for every h £ cind n G M, 



supe(^, [FxiYi-\Y:)oF^'oh] 



(C5) Let Z = F^^{Fx\YiX*\Y*))- For any x* € siipp(X*) the set supp(Z|X* = x*) contains some open 
neighborhood of x*. 

The properties above are not too difficult to verify, with the notable exception of the contraction condition (CSli 
which is not "single letter". This stems from the fact that the output distribution under mismatch is generally not 
i.i.d. Clearly, for any {Px,Py\x) G we have {Px,Py\x) <= ^c^^{Px,Py\x) in particular Moreover, if the 
posterior matching kernels for the pairs {Px , Py\x) and {Px* , Py'\X' ) happen to coincide, then we trivially have 
(Px* , Py'\X') S rig'''(P,Y, Py|x) and any rate R < I{X*;Y*) = I{X; Y) is pointwise achievable, hence there 
is no rate loss due to mismatch (although satisfaction of input constraints may be affected, see below). Note that 
the initialization step (i.e., transforming the message point into the first channel input) is in general different even 
when the kernels coincide. Nevertheless, identical kernels imply a common input support and so using a different 
initialization amounts to a one-to-one transformation of the message point, which poses no problem due to pointwise 
achievability. 

The channel model mismatch does incur a rate loss in general, as quantified in the following Theorem. 

Theorem\11.2 (Mismatch Achievability). Let {Px , Py\x) ^ ^c. and suppose the corresponding posterior match- 
ing scheme M6^ is used over a channel Py [x* ( unknown on both terminals). If there exists an input distribution Px» 
such that {Px* , Py*\x*) G ^^^^{Px, Py\x)' then Px* is unique and the mismatched scheme with a fixed/variable 
rate optimal decoding rule matched to {Px, Py\x)> pointwise achieves any rate 

R<I{X*-Y*)-[d{Py*\x*\\Py\x\Px*)-D{Py*\\Py)) (56) 
within an input constraint (77, Wii]{X*)) provided that Wi\ri{X*)\ < 00. 
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Proof. See AppendixiBl □ 

The difference between relative entropies in (1561 ) constitutes the penalty in rate due to the mismatch, relative to 
what could have been achieved for the induced input distribution Py* ■ Note that this term is always nonnegative due 
to the convexity of the relative entropy, and vanishes when there is no mismatch. 

For the next example we need the following Lemma. The proof (by direct calculation) is left out. 

Lemma VII.2. Let U,V be a pair of continuous, zero mean, finite variance r.v.'s, and suppose V is Gaussian. Then 

lege /E[/2 
D{Pu\\Pv) = h{V) h{U) + (^1^ - 1 

Example VII.2 {Robustness of the Schalkwijk-Kailath scheme). Suppose that the Schalkwijk-Kailath scheme 
designed for an AWGN channel Py\x with noise Z ^ A/'(0, N) and input X ~ A/'(0, P), is used over an AWGN 
channel with noise variance N*. Since the scheme depends on the channel and input only through the SNR = ^, 
then the scheme's kernel coincides with the Schalkwijk-Kailath kernel for an input X* ^ M{0, ^P) over the 
mismatch channel. Therefore, following the remark preceding Theorem I VII. 21 there is no rate loss, and the input 
power is automatically scaled to maintain the same SNR for which the scheme was designed. This robustness of the 
Schalkwijk-Kailath scheme to changes in the Gaussian noise (SNR mismatch) was already mentioned f23l . 

However, Theorem IVII.2I can be used to demonstrate how the Schalkwijk-Kailath scheme is robust to more 
general perturbations in the noise statistics. Suppose the scheme is used over a generally non-Gaussian additive 
noise channel Py'\X' with noise Z* having zero mean and a variance N*. Suppose there exists an input distribution 
Px' such that [Px- , Pyix-) G n'^'^'iPx , Py\x)- We have Y ^ X + Z andY* ^ X* + Z* for the original 
channel and the mismatch channel respectively. Plugging ( |22] | into the invariance property (CO and looking at the 
variance, we have 

X* . SNR • Z* \ ^ P* + SNR2 • N* 

E 



VI + SNR Vl + SNR / 1 + SNR 

which immediately results in SNR* — §t = SNR, so the SNR is conserved despite the mismatch. Now applying 
Theorem |VII.2| and some simple manipulations, we find that the mismatched scheme pointwise achieves any rate R 
satisfying 



R < h{Y*) - h{Z*) - {D{Pz. \\Pz) - D{Py. \\Py)) 
= h{Y*) - h{Z*) - \ h{Z) - h{Z*) - h{Y) + h{Y*) 



lege /E(Z*)2 E(r*)^ 



2 V Er2 



P + N N J ' • 2 N \ l + SNR 

= /(X;y) = ilog(l + SNR) 

where we have used Lemma|VIL2]in the first equality. Therefore, the mismatched scheme can attain any rate below 
the Gaussian capacity it was designed for, despite the fact that the noise is not Gaussian, and the input power is 
automatically scaled to maintain the same SNR for which the scheme was designed. Invoking 1241 . we can now 
claim that the Schalkwijk-Kailath scheme is universal for communication over a memoryless additive noise channel 
(within the mismatch set) with a given variance and an input power constraint, in the sense of loosing at most half a 
bit in rate w.rt. the channel capacity. 



VIII Discussion 

An explicit feedback transmission scheme tailored to any memoryless channel and any input distribution was devel- 
oped, based on a novel principle of posterior matching. In particular, this scheme was shown to provide a unified 
view of the well known Horstein and Schalkwijk-Kailath schemes. The core of the transmission strategy lies in 
the constantly refined representation of the message point's position relative to the uncertainty at the receiver. This 
is accomplished by evaluating the receiver's posterior c.d.f. at the message point, followed by a technical step of 
matching this quantity to the channel via an appropriate transformation. A recursive representation of the scheme 
renders it very simple to implement, as the next channel input is a fixed function of the previous input/output pair 



34 



VIII DISCUSSION 



only. This function is explicitly given in terms of the channel and the selected input distribution. The posterior 
matching scheme was shown to achieve the mutual information for pairs of channels and input distributions under 
very general conditions. This was obtained by proving a concentration result of the posterior p.d.f. around the 
message point, in conjunction with a contraction result for the posterior c.d.f. over a suitable function space. In 
particular, achievability was established for discrete memoryless channels, thereby also proving that the Horstein 
scheme is capacity achieving. 

The error probability performance of the scheme was analyzed, by casting the variable-rate decoding process 
as the evolution of a reversed iterated function system (RIFS), and interpreting the associated contraction factors as 
information rates. This approach yielded two closed form expressions for the exponential decay of the target error 
probability which facilitates the achievability of a given rate, then used to provide explicit results in several examples. 
However, the presented error analysis is preliminary and should be further pursued. First, the obtained expressions 
require searching for good weight or shaping functions, which in many cases may be a difficult task. In the same 
vein, it is yet unclear under what conditions the error analysis becomes valid for rates up to the mutual information. 
Finally, the basic technique is quite general and allows for other RIFS contraction lemmas to be plugged in, possibly 
to yield improved error expressions. 

We have seen that a fixed-point free kernel is a necessary condition for achieving any positive rate. We have 
also demonstrated how fixed points can sometimes be eliminated by considering an equivalent channel, or a cor- 
responding /^-variant scheme. But can this binary observation be refined? From the error probability analysis of 
Section lyT] it roughly seems that the "closer" the kernel is to having a fixed point, the worst the error performance 
should be. It would be interesting to quantify this observation, and to characterize the best /i-variant scheme for a 
given input/channel pair, in terms of minimizing the error probability. 

We have derived the rate penalty incurred in a channel model mismatch setting, where a posterior matching 
scheme devised according to one channel model (and input distribution) is used over a different channel. However, 
the presence of feedback allows for an adaptive transmission scheme to be used in order to possibly reduce or even 
eliminate this penalty. When the channel is known to belong to some parametric family, there exist universal feed- 
back transmission schemes that can achieve the capacity of the realized channel if the family is not too rich ll25l . and 
sometimes even attain the optimal error exponent ll26l . However, these results involve random coding arguments, 
and so the associated schemes are neither explicit nor simple. It would therefore be interesting to examine whether 
an adaptive posterior matching scheme, in which the transmitter modifies its strategy online based on channel esti- 
mation, can be proven universal for some families of memoryless channels. It seems plausible that if the family is 
not too rich (e.g., in the sense of IZTp then the posterior will have a significant peak only when "close enough" to the 
true channel, and be flat otherwise. Another related avenue of future research is the universal communication prob- 
lem in an individual/adversarial setting with feedback. For discrete alphabets, it was already demonstrated that the 
empirical capacity relative to a modulo-additive memoryless model can be achieved using a randomized sequential 
transmission strategy that builds on the Horstein scheme 1281 . It remains to be explored whether this result can be 
extended to general alphabets by building on the posterior matching scheme, where the empirical capacity is defined 
relative to some parametric family of channels. 

An extension of the suggested scheme to channels with memory is certainly called for. However, the posterior 
matching principle needs to be modified to take the channel's memory into account, since it is clear that a trans- 
mission independent of previous observations is not always the best option in this case. In hindsight, this part of 
the principle could have been phrased differently: The transmission functions should be selected so that the input 
sequence has the correct marginal distribution, and the output sequence has the correct joint distribution. In the 
memoryless case, this is just to say that X„ ^ Px, and is i.i.d. with the marginal Py induced by [Px , Py\x)^ 
which coincides with the original principle. However, when the channel has memory the revised principle seems to 
lead to the correct generalization. For instance, consider a setting where the channel is Markovian of some order d, 
and the "designed" input distribution is selected to be Markovian of order d as welB According to the revised prin- 
ciple, the input to the channel should be generated in such a way that any d consecutive input/output pairs have the 
correct (designed) distributiorQ, and the joint output distribution is the one induced by the designed input distribu- 
tion and the channel, so the receiver cannot "tell the difference". To emulate such a behavior, ad+1 order (or higher) 
kernel is required, since any lower order will result in some deterministic dependence between any d consecutive 
pairs. This also implies that ad+1 dimensional message point is generally required in order to provide the necessary 
degrees of freedom in terms of randomness. It can be verified that whenever such a procedure is feasible, then under 
some mild regularity conditions the posterior p.d.f. at the message point is w 2^(^"^^"), where /(X" — > F") 

2'By that we mean that y„ - X" ^Y"^} _ X"-''-iy"-<*-i and X„ - X"'W"'} _ X"-<*-iy"-'*-i are Markov chains. 

" n — d n — d n — d n — d 

--We interpret "marginal" here as pertaining to the degrees of freedom suggested by the designed input distribution. 
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is the directed information pertaining to the designed input distribution and the channel ||3T1 . This is encouraging, 
since for channels with feedback the directed information usually plays the same role as mutual information does 
for channels without feedback lISTl [32l [33l [34l [35l . Note also that the randomness degrees of freedom argument for 
a multi-dimensional message point, provides a complementary viewpoint on the more analytic argument as to why 
the additional dimensions are required in order to attain the capacity of an auto-regressive Gaussian channel via a 
generalized SchaUcwijk-Kailath scheme 1361 . It is expected that a scheme satisfying the revised principle and its 
analysis should follow through via a similar approach to that appearing in this paper 
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A Main Proofs 

Proof of Lemma UTJ] For the first claim, let us find the c.d.f. of F^^{Q): 

^{FxHO) <^)= P(inf{^ : Fx{z) > &} < x) T{Fxix) > 9) = Fx{x) 

where (a) holds since a c.d.f. is nondecreasing and continuous from the right, and so the result follows. For the 
second claim, define $ = Fx (X) — Q ■ Px {X) and let G (0, 1) be such that there exists xo € supp(X) for which 
Fx(xo) =<^. Then 

F^{(t>) > P{FxiX) < Fxixo)) = P(X < xo) = Fxixo) = </> 

and on the other hand 

F^i(b) < T{FxiX) - Px{X) < Fxixo)) = P(X < xo) = Fx{xo) = </> 
hence Fg,{(j)) = <f>. If such an xq does not exists then there must exist a jump point xi such that 

Fxixi) - Px{xi) <^< Fx{xi) ^ 01 

and so 

F^{4>) = F^{4>i) - = 01 - P(X = xi , e • Px{xi) < 01 - 0) = 01 - Px{xi) ■ ^-r^^ = 

\X\ j 

For a proper X there are no mass points hence the simpler result follows immediately. □ 

Proof of Lemma \lI3\ Assume we are given a transmission scheme g„ and a decoding rule A„ which are known to 
achieve a rate Rq. For simplicity, we assume the decoding rule is fixed rate, (i.e. | A(j/")| = 2""^" for all ?/"), since 
any variable rate rule can be easily mapped into a fixed rate rule that achieves the same rate. It is easy to see that in 
order to prove the above translates into achievability of some rate i? < i?o in the standard framework, it is enough 

to show we can find a sequence r„ = {Oi^n G (0, l)}j=i of message point sets, such that 6'.i_|_i „ — 6*^ „ > 2""'"" 
for any 1 < j < [2"^J , and such that we have uniform achievability over r„, i.e., 

lim maxP(6l ^ A„(r")|eo = 61) = 
n— >cxD sgr„ 

We now show how r„ can be constructed for any R < Rq. Let pe [n) be the (average) error probabiUty associated 
with our scheme and the fixed rate Rq decoding rule. Define 

An^{9e (0, 1) : P(eo ^ A(y")|eo ^9)> 

and write 

p,{n)^ f ¥{Qo ^ MY")\Qo ^ e)de > f tAAO)dO 
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and so we have that / 1a„ i9)d9 < y/pe{n). It is now easy to see that if we want to select r„ such that r„ n A„ = (j), 
and also 6*^+1. „ — 0i.„ > 2~"^°, then a sufficient condition is that ^^(1 ~ \/ Pe{n) ~ Tn) > 2~"^" for some positive 
Tn — > 0. This condition can be written as 

1 . ,^ , „ 1 



log |r„| <Ro + - log(l - ^/pe{n) - r„) = Rq + o(l) 
n n 



At the same time, we also have by definition 



lim maxP(6' ^ A(y")|eo = 6') < lim y/pjn) = 

n— )-oo d^Tn n— foo 

and the proof is concluded. □ 

Proof of Lemma \lTM Since ^ is n-convex over [0, 1], it has a unique maximal value attained at some (not necessarily 
unique) point x* . Moreover, convexity implies ^ is continuous over (0, 1), and since it is nonnegative and upper 
bounded by ^(x) < x, it is also continuous at a; = and ^(0) = 0. Now, define the sequence s„ smce 
^(x) < X the sequence .s„ is monotonically decreasing, and since ^ is nonnegative it is also bounded from below. 
Therefore, s„ converges to a limit Soo G [0, 1), and we can write 

lim Sn = Soo , lim ^(s„) = lim s„+i = Sqo 



n— >oo 



Since ^ is continuous over [0, 1) the above implies that (,{soo) — Soo, i-e-, Soo is a fixed point of ^. Thus, we either 
have ^ = in which case Soo = 0, or ^ ^ in which case the only fixed point for ^ is zero and so again Soo = 0. 
We now note that £^{x) < (,{x*) < x* for any x G [0, 1], and also that ^ is nondecreasing over [0, x*] and hence so 
is ^'^"^ We therefore have 

lim r{n) = lim sup ^^"^(x) < lim ^("-^^(x*) = lim s„ = 

□ 



Proof of Lemma \IL9\ For any e > 0, 

P (V(^n(.s)) > e) < £-iE[V;(5„(s))] = e^^E (E[V;(5„(s)) | Y"-^]) = e-^E (E[^(w,„ o 5„_i(s)) | F""!]) 

(b) (c) (d) (o) 

Markov's inequality was used in (a), the contraction relation ^ in (b) and Jensen's inequality in (c). Inequality (d) 
is a recursive application of the preceding transitions, and the definition of the decay profile was used in (e). □ 

Proof of Lemma Wl. 1 0\ For any e > 0, 

P(|^n(s) - 5n(t)| > e) = P(|^„(s) - S,M' > e") < e"'lE|5„(s) - ^nW^ 
= e-'?E(E(|5„(s)-^„(0r|r2")) 

(b) (c) 

< e^V • E(|w,,2 o • • • o w,.„ (s) - o • • • o < • • • < e"V"|s - 

Where in (a) we use Markov's inequality, in (b) we use the contraction ( fTOl i. and (c) is a recursive application of the 
preceding transitions. □ 

Proof of Theorem \lII.l\ We prove by induction that for any n G IN, Pq^^\y^ ('I?/") is proper for Pyn-a.a. e 3^", 
and the rest of the proof remains the same. First, this property is satisfied for n = since Peo is proper. Now assume 
the property holds for any l<n<fc— l.By our previous derivations, this implies that Xn ~ Px for any 1 < n < fc, 
and thus by the definition of an input/channel pair we have in particular I{Xn] Yn) = Y) < oo for any such 
n. Now suppose the property does not hold for n = k. This implies there exists a measurable set A C y'^ with 
Pyfc(A) > so that P0^j|yfc(-|?/'') ^ Pe„ for any <= A. Therefore, it must be that /(6o; 1^'") = oo. However 

standard manipulations using the fact that the channel is memoryless result in I{Qo;Y'') < J2n=i ^i^n] Yn) < oo, 
in contradiction. □ 
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Proof of Lemma UITJ] claim dnTJ. Since Q ^ U,it is enough to show that P<;>\e{-\d) is proper forZY-a.a. 9 € (0, 1). 
Define the discrete part of the output support to be yo = {y E supp(y) : > 0}, which is a countable set. 

Define also the set 2)£) = supp($|Y S J^d) which is a countable union of disjoint intervals inside the unit interval, 
corresponding to the "jump spans" introduced by Fy over yu. Furthermore, for any x G supp(X) define yo.x to 
be the set of mass points for Py\x{'\x). Since I{X; Y) < oo, then it must be that Py\x{'\x) Py for Px-a.a. 
X S supp(X). Therefore, there exists a set A C supp(X) of full measure Px{A) = 1, so that yD,x Q yo for 
any x € A. Therefore, for any x E A, Py\x{'\x) restricted to supp(F) \ yo has a proper p.d.f., which implies that 
P<i>\x{'\x) restricted to (0, 1) \ 2)d has a proper p.d.f. as well, since $ is obtained from Y by applying a continuous 
and bounded function. P^x{'\x) restricted to any one of the countable number of intervals composing SI)i3 is 
uniform, hence admits a proper p.d.f. as well. We therefore conclude that P^x{'\x) is proper for any x G A. To 
conclude, define the set B — {6 & (0, 1) : F^^{9) <E A}, which by Lemma UlTI is of full measure U{B) = 1, and 
from the discussion above -Pi>|e('|^) is proper for any 6 E B. □ 

Proof of Lemma UV4\ claim (O. Suppose there exists some yo E y so that Py(?/o) > and Px <d Px\Y{'\yo)- 
Define the set Aq ^ {(j) E (0, 1) : i^y ^(0) = yo}. For any x E X and (j) E Aq, the normalized posterior matching 
kernel evaluated al6 = Fx {x) satisfies 

Fq\^{Fx{x)\4>) = Fx\Y{x\ya) > Fx{x) 

where the last inequality is due to the dominance assumption above, and is strict for x E {0, . . . , | A"! — 2}. Moreover, 
the normalized posterior matching kernel evaluated in between this finite set of points is simply a linear interpolation. 
Thus, for any 9 E (0, 1) and any (p E A^we have Fe|$(^?|(/i) > 9, and so 

P(i^e|*(^l*) =0) <l- P'S>{Ao) = 1 - Py(yo) < 1 

which impHes the fixed-point free property (A[3]l. The case where Px\Y{'\yo) ~<d Px follows by symmetry. The 
case where Px|y(-|yo) Px{-\yi) is trivial. □ 

Proof of Lemma UV4\ claim dTnl ). We find it simpler here to consider the normalized input but the original output 
Y, namely to prove an equivalent claim stating that the invariant distribution Pqy for the Markov chain {Qn, Yn), is 
ergodic. To that end, we show that if S* C (0, 1) x 3^ is an invariant set, then PeY{S) E {0, 1}. Let us write S* as a 
disjoint union: 

S^[jAyX{y}, AyC (0, 1) 

The posterior matching kernel deterministically maps a pair {9, y) to the input 9 = i^e|y(^l2/)> ^nd then the 
corresponding output is determined via Fy|e( |^). Since by (B[T]i all transition probabiHties are nonzero, then each 
possible output in y is seen with a nonzero probability given any input. Thus, denoting the stochastic kernel of the 
Markov chain by V, we have that V{-\{9, y)) has support on the discrete set {i^e|y(^|y)} ^ 3^ for any (0, y) E S. 
Since S is an invariant set, this implies that 

5' " U F^\Y{Ay\y)xy<^S 

where by Fq^y {Ay\y) we mean the image set of Ay under i^e|y(-|y). This in turn implies that 

\J FeiY{Ay\y) c f] Ay ^ A (57) 

Now, defining 

s^Axy 

we have that S' C § C S, and hence 5* is also an invariant set. Going through the same derivations as for S, the 
invariance of S implies that 

y Fe\Y{A\y) c A (58) 
y&y 
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and hence 

U{A) > maxZY(Fe|y(A|y)) > ^ Z^(Fe|y (A|2/))Py (y) = ^ Pe^Y {A\y) Py {y) = Pb{A) =U{A) 
y&y ^ — ' ^ — ' 
vey yey 

To avoid contradiction, it must be that U{FQ^Y{A\y)) = for all y G y, and together with ( fSST l it immediately 

follows that for ally G y 

Fe\Y{A\y)^A\Ny, U{Ny)^0 (59) 

Namely, for any output value, the set A remains the same after applying the posterior matching kernel, up to a U-null 
set. 

Let us now prove the implication 



U{A) e {0, 1} ^ PeY{S) e {0, 1} (60) 

To that end, we show that < Pqy{S) < 1 implies < U{A) < 1. The upper bound follows from U{A) = 
Pq {A) = Pqy (S) < PeY (S) < 1 • For the lower bound, we note that Pqy (S) > implies there exists at least one 
yo & y such that U{Ay„ ) > 0. Recall that for a DMC, the normalized posterior matching kernel for any fixed output 

is a quasi-affine function with slopes given by ^'p^(!^|^"' — fa)^'' ■ Since by all the transition probabilities 
are nonzero, these slopes are all positive, and denote their minimal value by a > 0. Therefore, it must be that 
ll{FQ\Y{Aya\yo)) > aU{Ayg) > , which by ^ implies > 0. 

After having established ( |60l ), we proceed to show that U{A) E {0, 1} which will verify Property (A|2]i. It is 
easily observed that if A is an interval, ( [59l ) holds if and only if the endpoints of the interval are both either fixed 
points of the kernel or endpoints of (0, 1). For A a finite disjoint union of intervals, i5% holds if and only if all 
non-shared endpoints are both either fixed points of the kernel or endpoints of (0, 1). Hence for such A, since we 
assumed the kernel does not have any fixed points, (|59] | holds if and only if U{A) G {0, 1}. 

Let us now extend this argument to any A e *B. Under (B(3]l, there exist two output symbols ya,yi G y such 

that 

Pi 



where 



= ( p^(o) ) ' * e {0, 1} 



Define the set 

B = {& G (0, 1) : 3no, ni G IN, 6 = 2"o''°+"i'^i } 
Lemma A.l. B is dense in (0, 1). 

Proof. Without loss of generality, we assume /3o < < We prove equivalently that the set log_B is dense in 
(-00, 0). Let b G (-0O, 0). Define 



bn = n(3o + 



Pi 



n G 



It is easy to see that {^n}^„' C log B, if n' is taken to be large enough. Let {x} ^ a; — [xj be the fractional part 
of X. Write: 

^ b-bn ( b f fio" 



5i '"^A 



Since ^ Q, r„ can be though of as an irrational rotation on the unit circle, hence is dense in (0, 1) ll37l . In 
particular, this implies that r„ has a subsequence rk„ — > 0, hence bk„ b. □ 

For G (0, 1), let A{9) = A n (0, 9). For brevity, let p = Px{0). Define A„„_„i be the set obtained starting 
from A{p) and applying -Feivl'lyo) n-o times, and then applying -Fe|y(-|yi) ni times. -Fe|Y('l2/i) is linear over 
{0,p) with a slope hence assuming that 2"-"^"^^^^^ < 1, we have 

W(A„„,„J = 2"«^«+"i^i -l^iAip)) (61) 



39 



A MAIN PROOFS 



On the other hand, i59[ together with the aforementioned linearity imply that Ano,ni and A {p ■ 2"o^o+"i^i) are 
equal up to a U-nuW set. Combining this with ( |6T1 ) and Lemma [ATI we find that for any 6 <E (0, p) 

u{A{e)) = ep-^u{A{p)) 

We note that U{A{6)) is the indefinite Lebesgue integral of l^(p')(6'). Invoking the Lebesgue differentiation The- 
orem 1201 . the derivative '^'^'^g^^^ — p^^U{A{p)) must be equal to lyl(p)(^^) for a. a. 9 € (0,p), which implies 
p~^U {A{p)) S {0, 1}. Hence A{p) is either of full measure or a null set. 

Let us now show that this implies the same for A = ^(1). Define the function 

F{0) = maxFe|y(6i) 

Let us establish some properties of F. 

(a) F is Lipschitz, monotonically increasing, and maps (0, 1) onto (0, !).• Trivial. 

(b) F {6) ^ 1 monotonically as n oo for any 9 G (0, !).• Observe that 

E (F0|y(6'|y)) = EP(e < 9\Y) = P(e <9)=9, 

Hence F{9) > 9 with equality if and only if 6* is a fixed point, which contradicts property (AO. Thus it must 
hold that F{9) > 9 for any 9 G (0, 1), hence 

is increasing with n. F < I and therefore a limit exists 
and is at most 1. F is continuous, hence the limit cannot be smaller than 1 as this will violate F{9) > 9. 

(c) F(A) ^ A up to a U-null set: it is easily observed that 

fl FQ\Y{A\y) C F{A) C U Fe|y(A|y) 
yey yf^y 

The property follows by applying ( |59] |. 

Combining ([a)i and © it follows that for any n > 1, F {A{p)) ~ A{F {p)) up to a U-nnW set. Further- 
more, since A{p) is either of full measure or null, then property (O implies the same for F {A{p)), and so either 

U{A(F^''\p))) = for all n, orU{A(F'''"\p))) = F^"\p). Using ©, we get: 

U{A) ^^{[j MF^'^hp))^ e {O, Jim {0,1} 

Hence (A|2ll holds. 

Remark A.l. The proof only requires an irrational ratio to be found for x = (or similarly, for x ~ \X\ — 1), hence 
a weaker version of property (BO suffices. It is unclear if even this weaker property is required for ergodicity to 
hold. The proof fails whenever the leftmost interval (0, Px{0)) cannot be densely covered by a repeated application 
the posterior matching kernel (starting from the right endpoint), without ever leaving the interval. This argument 
leans only on the linearity of the kernel within that interval, and does not use the entire non-linear structure of the 
kernel. It therefore seems plausible that condition (B|3]l could be further weakened, or perhaps even completely 
removed. 

□ 

Proof of Lemma UV4\ claim ((vp. (B[T]i trivially holds for any equivalent input/channel pair Let us show there exists 
one satisfying (J3|2]i. To that end, the following Lemma is found useful. 

Lemma A.l. Let p"',q"' be two distinct probability vectors. Then there exists a permutation operator a : R" R" 
such that cr((7") -<d a{p"-). 

Proof. Let 5" be the element-wise difference of and g", i.e., 5k = Pk — Qk- Define tr to be a permutation operator 
such that fT(5") is in descending order Then since ^ q" and Si — Owe have that any partial sum of cr((5") 
is positive, i.e., J2i=i{'^i^^)}i > for any k < n, which impUes the result. □ 
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Now, since I{X; F) > there must exist some yo G y so that Px\Y{'\yo) 7^ Px- Viewing distributions as prob- 
ability vectors, then by Lemma lA!2] above there exists a permutation operator a such that (j{Px) ~<d o'{Px\Y{'\yo))- 
Thus, applying a to the input results in an equivalent input/channel pair for which (B|2]i holds. □ 

Proof of Lemma UV4\ claim dvTl). Let {P{X),dTv) be the space of probability distributions over the alphabet X, 
equipped with the total variation metric. For a fixed channel Pyix, the set S of input distributions not satisfying 
property (BO is clearly of countable cardinality. Since any non-singleton open ball centered at any point in P{X) is 
of uncountable cardinality, then P{X) \ S must be dense in {P{X), drv), and the claim follows. 

□ 

Proof of Lemma [VJ] Let A : [0, 1] i—> [0, 1] be any surjective, strictly n-convex function symmetric about i. This 
implies in particular that A(-) is continuous, its restriction to [0, i] is injective, and A(0) = A(l) = 0, A(i) = 1. Let 
A^^ : [0, 1] i-> [0, ^] be the inverse of A restricted to the [0, ^] branch. Let be the corresponding length function 
over as defined in (|8j. Define the function ^* : [0, 1] 1-^ [0, 1] as follows: 

C{e) ^ max{EA (Fe|*(A-i((?)|$)) ,EA (i^e|*(l - A-i(0)|$))} 

We now establish the following two properties: 

(a) £,*{■) is continuous over [0, 1].- Fix any 6' e [0, 1], and let {9n}^^i be a sequence in [0, 1] such that On — > 
9'. Define g(6l,0) = A (Fe|*(A-i(6')|(?i)), and = g(6'„,</>). By Lemma|lILl]claim (|ivli, i^e|<E.(6'l0) is 
continuous in 9 for P$-a.a. cf) G (0, 1). Since A(-), A~^(-) are continuous, we have that q{9, (p) is continuous 
in 9 for P$-a.a. e (0, 1), and therefore (7„(^) — q{9' , 0) for P$-a.a. (p G (0, 1). Furthermore, [f/riC^)! < 1- 
Thus, invoking the bounded convergence Theorem ll20l we get 

lim E(<7„($)) -E(<z(0',$)) 

n~yoo 

Reiterating for q{9, </>) = A (i^e|<E.(l - ^^H^M))^ we conclude that C*{9„) -> C*{9'). 

(b) < ^* (9) < 9 for 9 e (0, 1].- The lower bound is trivial. For the upper bound, we note again that 

E (Fe|4,(0|$)) = EP(e < 6i|$) = P(e <9) = 9, 

and since by the fixed-point free property (AO we also have P(Fe|$(6'|$) = 9) < 1 for any 9 E (0, 1), then 
Fq\^{9\^) is not a.s. constant. Combining that with the fact that A(-) is strictly n-convex, a strict Jensen's 
inequality holds: 

EA(Fe|*(A-i(0)|<l>)) < A (E (Fe|*(A-i(0)|$))) = X{X-\9)) = 9 
Similarly, using the symmetry of A(-), 

EA(i^e|*(l - A-i(0)|$)) < A(l - \-\9)) = XiX-\9)) = 9 

Now, define ^(•) to be the upper convex envelope of ^* (•). Let us show that ^(•) is a contraction. ^(•) is trivially 
n-convex and nonnegative, hence it remains to prove that ^(6*) < 9 for 9 G (0, 1]. Define the function 

s{9)^ M {<i>-cm 

4>e[e.i] 

Property (O implies that (5(0) = 0. Combining properties ^ and (O, we observe that (j> — £,*i<p) is continuous and 
positive over [9, 1] for any fixed 9 G (0, 1], hence attains a positive infimum over that interval. We conclude that 
S{9) is continuous and monotonically nondecreasing over [0, 1], and positive over (0, 1]. Fixing any 9' G (0, 1], we 
use the definition of the upper convex hull and the above properties of 5{-) to write 

a9') = snp{aC{9o) + (1 - a)Ci9i)} < sup{a(0o - ^(^0)) + (1 - a){9, - 5{9,))} 

<9' - inf {a5{9^) + (1 - a)5{9')} (62) 

where the supremums and the infimum are taken over all {6*0, 9i, a] such that Q < 9q < 9' < 9i < \, and such that 
9' is the convex combination 9' = a9o + (1 — a)9i. Thus, since S{9') > 0, a necessary condition for £,{9') > 9' 
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is for the infimum in ( l62l i to be attained as a 1 and 5{6o) — ?> 0. By continuity and positivity, the latter implies 
6*0 — > 0. However, the convex combination for 6' can be maintained as a — > 1 and 6*0 —> if and only if 6' = 0, in 
contradiction. Hence £,{9') < 9'. 

To conclude the proof, we demonstrate that £{■) and ^/^;^ satisfy ( |36] |: 

/ \ (a) (b) /-l 

iE(^4F0|<,(.|$)o/i]j = EA(^^ei*(M^)i$))rfe < ciHHmd9 < aHHmd9 

where (a) holds by the definition of ^* and the symmetry of A( ), (b) holds since £>£,*, and (c) holds by Jensen's 
inequality. □ 

Lemma A.3. Let {Px,Py\x) satisfy property {^^. Then for any a > 0, e > and 5 > Q, 



P max 

y l<m<?i 

where r(n) is the decay profile of the contraction from Lemma \Vl\ 
Proof. For any g G and any m, n e IN where m < n, define 

Then for any fixed m and g, {Gf„ „}^„j is an IFS over ^c- Let gM(6') = 6* be the uniform c.d.f., and define the 
following rv.'s: 

Lm,n = Tpx (G^'.n) L*^.n ^ SUp (G^m.n) 

where i/j^ is the associated length function from Lemma lVTI Clearly, L^.n < n- Furthermore, L'^ „ < L^_^_i „ 
for any m < n — 1. To see that, we note that L*j „ is a deterministic function of <i>J^j = ('J'm, ■ • ■ , ^n), hence there 
exists a sequence of functions {gk{9; such that ^JJ:^) € iJc for any sequence (j)"^ g (0, and 



Therefore, for any ly > Owe have 

P I max i,„,(i+a)n > ) < P ( max (i+„)„ > 1 = P (L;,(i+„)„ > 1 < i^"V(an) 

\l<m<n I \ l<m<n 'v / / y / j 



where we have used Lemmas|lL9]and|VT]for the last inequality, noting that the former holds for any IFS initiaUzation. 
The proof now follows that of Lemma |V2l with the proper minor modifications. □ 

Lemma A.4. Let {Px,Py\x) satisfy (/^J} and Then (H?]) holds, and for any rate R < I{X; Y) 



mill ( e, ) J- 1 = 



lim limsupP I n (bjc - "O;!'^ < mil 

lim limsupP ^ f] l+e^-^ - Ofc < min (e, ) }) " ^^^^ 

Furthermore, if Px is also the unique input distribution for Py\x such that I{X: Y) = C(Py|x), then 

n 

lim n"^ V r]{Xk) = E(7?(X)) a.s. (64) 

fc=l 

for any measurable rj : X i-^ H satisfying E(|?7(X)|) < oo. 
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Proof. Without the ergodicity property (A|2]i, we cannot directly use the SLLN which was a key tool in deriving (l44l) 
and (08]). Instead, we use the ergodic decomposition for Markov chain^ 1141 Section 5.3] to write the invariant 
distribution Pe* as a mixture of ergodic distributions. We then apply the SLLN to each ergodic component, and use 
the maximality property (A|4|i to control the behavior of the chain within each component. For clarity of exposition, 
we avoid some of the more subtle measure theoretic details for which the reader is referred to 1141 . 

Let V denote the Markov stochastic kernel associated with the posterior matching scheme. The ergodic decom- 
position implies that there exists a r.v. F taking values in (0, 1), such that F = x(6) for some measurable function 
X '■ (0, 1) ^ (0, 1), and -Pe<i>|r('l7) is ergodic for V, for i^-a.a. 7. Let us first show that $ and F are statistically 
independent. For any 5* £ 03, it is clear that Pe|r('|5') is an invariant distribution for V, being a mixture of ergodic 
distributions. Hence the set x~^{S) must be invariant for the posterior matching kernel, i.e., 

Fe\^{x-\sm ^ X-\S) (65) 
up to a Pe-nuU set, for P<i,-a.a. 0. Define Z = Pei*!©!'^')- For any S", T e <B: 



Pr*(5,r) = Pe*(x-'(5),T) Pz^{x-\S),T) Pz{x-\S)) ■ P^{T) 
Peix-\S)) ■ P$(T) = Pr{S)P^iT) 



where (a) follows from (l65T l and the fact that {Z, $) a.s. determines Q, (b) holds since Z is independent of $, and 
(c) holds since Z ^ Pq (i.e., uniform). 

We can now apply the SLLN (Lemma FlLTb to each ergodic component X^^(7). For Pp-a.a. 7 and Pe|r(-|7)-a.a. 
message points 6*0 € X^'^il) 

lim 1 log/e„|*.(^o|$") - E flog ^^^^^ | T = 7) 

Now, for any 7 

/(e;<i>|r = 7)<C(P3,|e) = C(Py|x) 

where the inequality holds by the definition of the unconstrained capacity and since F — 8 — $ is a Markov chain, 
and the equality holds since the normalized channel preserves the mutual information (Lemma flll.lb . Furthermore, 
using the independence of $ and F and the Markov relation above again, together with property (A|4|, leads to 

/(e;<&|F) = /(e;$) = C(Py|;f) 

Combining the above we conclud that for Pp-a.a. 7 

7(e;$|F = 7) = C(Py|x) (67) 

Substituting the above into (l66] l yields 

lim ilog/0„|$„(0o|<f>") = C(Py|;f) a.s. 

n— >c« 71 

for Pe-a.a. 6q. This in turn implies ( |44] |. 

Establishing ( |63l l follows the same line of argument, proving a weaker version of ( |48] ). By the ergodic decompo- 
sition, for Pr-a.a. 7 and Pe|r(-|7)-a.a. message points 9o S x~^(7) 

1 " 

l™,-Elog"/l|e('J^'»|0fe) = lE(log-/||e(*|e) |r = 7) a.s. 



n—>oo 71 

k=l 



= C,i-f) (68) 



23i 



^The chain has at least one invariant distribution, and evolves over a locally compact state space (0, 1) , hence admits an ergodic decomposi- 
tion. 

-''Note that (67) does not hold in general if property (>i|4) is not satisfied, as there may be variations in the limiting values between ergodic 
components. 



43 



A MAIN PROOFS 



The function Ce (7) satisfies 

E/:ar) = E (iog-/||e($|e)) = 



and since /||e < /$|e, then 

Ceh) < C{Py\x) 

for Pr-a.a. 7. Now since I{X] Y) = C{Py\x) under property (A|4]i, then 

^C{Py\x) (69) 
It is therefore clear that for small e values C^{'^) must be close to /~ for a set of high Pr probability. Precisely: 
A,,. ^ {7 G supp(r) : £,(7) > K - - /-)} 

Then 

/- = E/:,(r) < Pr{A,^,)C[PY\x) + (1 - Pr(Ae,.)) (4- - u-\C{Py\x) - O) 
Rearranging, we get 

Pr{A,.,) > (70) 
Combining ( |68] |, ( |69] l and dTOl i. we conclude that for any v > Q and any e > small enough, 

P [ lim i X^log7l|e(1>^|0fe) > 4" - — ) > (71) 



for Pe-a.a. message points 6*0, where (5(e) — )• as e ^ 0. The remainder of the proof follows that of Lemma |V31 
with some minor adaptations. 

Finally, suppose Px is the unique capacity achieving input distribution for Py\x- For Pr-a.a. 7, 

I{X; Y\T = 7) = *|r - 7) - C{Pyix) (72) 

Thus, since F — X — y is a Markov chain and from the uniqueness of Px as capacity achieving, it must be that 
Px\r{-\l) ~ Px{-) for Pr-a.a. 7. Applying the SLLN to each ergodic component, we find that for Pp-a.a. 7 and 
-Pe|r(-|7)-a-a. message points 6*0 € X {l) 

n n 

lim V?/(Xfe) = lim n^^ V ?/(p-i(efe)) - E(?/(p-i(e))|F = 7) a.s. 

n— ^00 ^ — ^ n— ^cxD ^ — ^ 

==E(,7(X)|F-7) = E,KX) 

establishing ( l64b . 

Remark A.2. It is instructive to point out that the proof of the Lemma holds also when property (AO is not 
satisfied, namely when the posterior matching kernel has fixed points. In that case, each ergodic component must 
lie strictly inside an invariant interval (i.e., an interval between adjacent fixed points), which results in a decoding 
ambiguity as the receiver cannot distinguish between the ergodic components. As discussed in Section IVlIllAl this 
exact phenomena prevents any positive rate from being achieved, and generally requires using a posterior matching 
variant. The fact that capacity is nonetheless achieved under (AlH in the absence of fixed-points even when the chain 
is not ergodic, suggests that in this case almost any ergodic component, in addition to being capacity achieving in 
the sense of ( |67] |. is also dense in (0, 1). The intuitive interpretation is that in that case any interval intersects with 
almost all of the ergodic components, hence the receiver, interested in decoding intervals, is "indifferent" to the 
specific component the chain hes in. 

□ 
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B Pointwise Achievability Proofs 

Proof of Lemma |70] Property (v4|5]l implies in particular that Fx{x),FY{y) are continuous and bijective over 
siipp(X), siipp(Y) respectively, and that Fxy {x\y) is jointly continuous in x, y over supp(X, Y). The normalized 
posterior matching kernel is therefore given by 

FeiM) = Px\Y{Fx\d)\Fy'm 

and is jointly continuous in 9, (f> over supp(8, $) (note that for the family this does not hold in general). Thus, 
property (^413*) implies (by continuity) that for any 9 G (0, 1) there exists some ((> e (0, 1) so that FQ\^{9\(f)) ^ 9^ 

We first show that the chain is Pe* -irreducible. Let *8 here denote the usual Borel cr-algebra corresponding to 
the open unit interval. Since 0„ is a deterministic function of (Qn-i, ^n~i), and since $„ is generated from 0„ via 
a memoryless channel, it follows (by arguments similar to those given in the proof of Lemma lIV.4b that to establish 
irreducibility it suffices to consider only the 0„ component of the chain, and (since Pe* has a proper p.d.f.) to show 
that any set A e *B with U{A) > is reached in a finite time with a positive probability starting from any fixed 
message point <3q = G (0, 1). 

Define the set mapping tt : *B i-^ *B 

7r(A) ^ {ee (0,1) : ^^Fei^{9\cb),9eA,cf>esupp{<f\e = e))} 

namely, the set of all points that are "reachable" from the set A in a single iteration. If A is an interval (or a single 
point), then it{A) is also an interval, since it is a continuous image of the set A' ~ siipp(8, $) n {A x (0, 1)}, 
which by property (A^ is a connected seQ For any 9q G (0, 1) it holds that E (Fe|$(0ol^)) = ^'o- and together 
with property (AlSf) it must also be that 

inf Pe|*(^o|0) < ^0 < sup Fe|*(^o|'/') (73) 
</>esupp(*|e=e»o) 06supp(*|e=eo) 

Thus, 6*0 is an interior point of the interval 7r({6'o}). The arguments above regarding tt can be apphed to all points 
within the set 7r({6'o}), and then recursively to obtain 

Oo e ^({^o}) C 7r(2) {{9o}) C . . . C TT^") ({0o}) C . . . (74) 

where 7r(") is the n-fold iteration of n. Therefore, {tt'^"^ ({^o})}5?Li is a sequence of expanding intervals containing 
00 as an interior point. Note also that 7r^")({(?o}) = supp(0n|6o = ^o)- Consider the set 

oo 

n=Q 

Let us show that Ag^ = (0, 1). First, it is easy to see that Ag„ is an open interval, since it is a union of nested 
intervals, and if it had contained one of its endpoints then that endpoint would have been contained in tt^"' ({^o}) for 
some n, which by the expansion property above is an interior point of 7r^"+^)({0o}) C Ag^, in contradiction. Now, 
suppose that Ag^ — {9i,62) for 9i > 0. Using ( |73] l and the continuity of FQ^^{6\(f>) once again, we have 

lim inf Pe|*(6'l0)= inf FQ,^{9i\<j>) < 9i 
e-).e+ 4>esupp{<s>\e=0) 0esupp(<i'|e=ei) 

in contradiction. The same argument applies for 6*2, establishing Ag^ — (0, 1). As a result, for any set A G *B 
with U{A) >0 we have that U{A n 7r(") {{da})) U{A) as n — > oo. Therefore, there exists a finite n for which 
Z^(An7r(")({6'o})) > 0, and since «; Pe„|eo when restricted to 7r(") ({6*0}), it must be that Fe„|eo(^l^o) > 0. 
Thus, the normalized chain is Pe* -irreducible. It was already verified that Pe$ is an invariant distribution, hence 
by Lemma llL4l the chain is also recurrent, Pe$ is unique and ergodic, and so property (A|2l) holds. 

^^Note that continuity also implies there is an interval for which this holds, and since $ ~ W, the stronger property (^4(3} holds. 

^^This is proved as follows: Since Fx, Fy are continuous, the set supp(0, <3>) inherits the properties of supp(X, Y), namely it is con- 
nected (and open, hence path-connected) and convex in the </>-direction. Therefore, any two points in a, 6 € A' can be connected by a path in 
supp(0, <3>). If this path does not lie entirely in A' , then consider a new path that starts from a in a straight line connecting to the last point in the 
original path which has the same 6 coordinate as a, then merges with the original path until reaching the first point with the same 6 coordinate as 
b, and continuing in a straight hne to b. Since supp(0, <E>) is convex in the 0-direction this new path is completely within A' . 
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B POINTWISE ACHIEVABILITY PROOFS 



Let V denote the stochastic kernel of our Markov chain. To establish p.h.r., we would like to use condition (O of 
Lemma HO] However, 0„+i is a deterministic function of (6n, thus (j>)) Pe* (as the former 

is supported on a Pe$-iiull set). Nevertheless, it is easy to see that due to the expansion property, the 2-skeleton 
of the chain (which is also recurrent with the same invariant distribution) admits a proper p.d.f. over a subset of 
supp(9, $) and therefore V^{-\{9, 0)) «; Pe* for any {9, (jy) e supp(6, $). Thus, by condition ^ of Lemma HOI 
the 2-skeleton is p.h.r., which in turn implies the chain itself is p.h.r. via condition ^ of Lemma HO] 

To establish aperiodicity, we use the expansion property ( l74l ) once again. Suppose the chain has period d> 1 and 
let {DiYlllQ be the corresponding partition of the state space supp(0, $). From our previous discussion we already 
know that for any (6'o, 0o)- the set supp(8n|6o = 6*0, $o = '/'o) is an interval that expands into (0, 1) as n — > oo. 
Since we have the Markov relation $„ — 0„ — the set supp(9n, ^nlQo = 6*0, = 4>o) expands 

into supp(8, $) in the sense that it contains any open subset of supp(8, $) for any n large enough. Therefore, by 
definition of periodicity for any n S IN and i e {0, . . . , d — 1} we have P((0„d+i, '^nd+i) G Di|(6o, $o) G ^o) = 
1, and since Pe* <^ U x U , then it must be that {U x U) (supp(9, $)\Di) = for any i e {0, 1, . . . , d - 1}. 
However, this cannot be satisfied by d > 1 disjoint sets. 

□ 



Lemma B.l. Suppose (Px,Py\x) G ilc- Then Lemmas \V3\ and \V.4\ hold for any fixed messase point 0n — Oq £ 
(0, 1). Furthermore, for any e > 0,6 > and 9q € (0, 1)." 

lim P("ef, > e|eo = Oo) = lim P(+ef, < 1 - e|eo = Oq) = 



Proof. The proofs of Lemmas IV.31 and rv.4l remain virtually the same, only now using the SLLN for p.h.r. chains 
(Leinma [n.7l) to obtain convergence for any fixed message point. 

Since by Lemma llV.3l the normalized chain is p.h.r and aperiodic. Lemma HO] guarantees that the marginal 
distribution converges to the invariant distribution Pe$ in total variation, for any initial condition and hence any 
fixed message point. Loosely speaking, we prove the result by reducing the fixed message point setting for large 
enough n, to the already analyzed case of a uniform message point in Lemma [V2] 

First, let {$n}i^Li be a sequence of r.v.'s such that Pj tends to U in total variation. Then the result of Lemma 
IV. 1 l ean be rewritten as _ x 

lim E(v.[Pe|*(-|*n)°/i]) < Ci^.ih)) (75) 

which holds since the expectation is taken over a bounded function. 

Now, consider the /c-fold chain ^n~^'^~'^}^^i for some fixed k. It is immediately seen that this chain is 

also p.h.r., and its invariant distribution is Pe$, the fc-fold cartesian product of Pe*. Thus, by Lemma HO] the fc-fold 
chain approaches this invariant distribution in total variation for any initial condition. In particular, this implies that 

lim dTy(P$.+.-i|e„(-|eo),Z^^) = 

where U'^ is the fc-fold cartesian product of U. Namely, the distribution of k consecutive outputs tends to i.i.d. 
uniform in total variation. Using dTSt and a trivial modification of Lemma HO] for an asymptotically i.i.d. control 
sequence, we have that for any fixed k 

lim P (Va(G„(0)) >iy\eo^Oo)<- r{k) (76) 

n— ^oo p 

where r(-) is the decay profile of ^. Let be the smallest integer such that for any n> Uk 

1 



p (^.(G„(0)) >i^\eo^ Oo) < - ^/Hk) 

holds, which must exist by ( l76] l. Thus, 



1 



lim P (V'.(G„, m >y\Qo = Oo) < - Imi VK^) = 

fe— >oo ly fc— S-OO 

Now, the proof of the Lemma follows through by working with (fc, n^) in lieu of n, and in ( l43] l using the fact that 
the distribution of Gn (Oo) tends to U in total variation. □ 
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B POINTWISE ACHIEVABILITY PROOFS 



Proof of Theorem \VII. 21 Let us first make the distinction between the Markov chain generated by the posterior 
matching scheme for {Px, Py\x) when operating over the channel Py\x^ according to whose law the transmitter 
and receiver encode/decode, and the chain generated by the same scheme when operating over the channel Py \x*^ 
which describes what actually takes place during transmission. We refer to the former as the primary chain denoting 
its input/output sequence as usual by (X„,y„), and to the latter as the mismatch chain, denoting its input/outptut 
sequence by {X*, Y*). The same monikers and notations are used for the normalized counterparts. 

Property (CHJ guarantees that the expansion property holds for the mismatch chain, and since by Property {dQ^ 
Px*Y* is an invariant distribution, a similar derivation as in Lemma H V. 3 1 implies that the mismatch chain is p.h.r., 
which in particular also guarantees the uniqueness of Px'v ■ We would now like to obtain an analogue of Lemma 
IV. 3 1 Let us expand posterior p.d.f. w.r.t. the primary chain, using the fact that it induces an i.i.d. output distribution 
is (this does not necessarily hold for the mismatch chain) and the channel is memoryless. 

f lnu,n\ f^r,\eo,Y"-^iyn\d,y"^^) fnln,^-i\ fY\xiyn\gn{0,y'^^^)) (aL,n-l\ 

^^"1""^"^^= ) = M^) /eo|r"-(% ) 

Applying the recursion rule n times, taking a logarithm and evaluating the above at the message point, we obtain 

1, . ... „x i'^, .fY\xiyk\gki9,y'"~^)) 

Now we can evaluate this posterior of the primary chain using the inputs/outputs of the mismatch chain, and apply 
the p.h.r. SLLN (Lemma [V.3b for the mismatch chain using its invariant distribution Pe*** ■ 

1, , fn\v*n, V Iv-, fY\x{YC\gkieo,Y*'^-')) (a) ^. 1^ fYix{Y*\X*) 
hm -log/ooir- eo y = hm - > log — ' = lim - > log — . . 

/ fYlxiY*\X*)\ 
''=^ E flog iL^(rW ^ ^ fY^lx^ iY*\X*) 



\ JY'\X'{Y*\X*) ° fviY*) " fy-iY*) 

= I{X*;Y*) - {DiPY*\X'\\PY\x\Px') ~ D{Py4Py)) 
^ R'^''{X,Y;X*,Y*) 

where in (a) we used the definition of the channel input, and in (b) we used Property {d^ and the convexity of 
the relative entropy which together guarantee that D{Py* \\Py) < D{Pyi^x\\-Py\x I Px* ) < 00. The same analysis 
using normalized chains results in 

lim ilog/e„|*40o|$*") = Elog/<,|e(<i>*|e*) = i?""^ Pe,-a.s. 

where the last equality is due to the invertibility of the chain normalization, which is guaranteed by property (A|5]|. 
Now we can define the analogue of in Wt\ as follows: 

i?r^ IE log -/||e('i>*|0*) 

Therefore, 

< i?-- - RT = D (p<,-|e. II -P||e \Pb)-D (P$.|e* ||P$|e I Pe) 

The second term on the right-hand- side above is finite due to Property ((I©, and by the Property (C[T]) we have that 
inf5>o -D(P$*|e* ll~^||e l^s) < Thus, for any e small enough 

-00 < Rf"" < P""*^ 

We can now continue as in the proof of Lemma (IV. 4b . to show that ( l46b holds in this case for any rate R < i?""^. 

The contraction Property (CHli implies the equivalent of Lemma [V2l for the mismatch chain, since although the 
output sequence Y* is not necessarily i.i.d. even when we start in the invariant distribution, we have a contraction 
uniformly given any conditioning. Tied together with the above and repeating the last steps of Theorem lV.il the 
achievability of ( |56] | is established . □ 
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C MISCELLANEOUS PROOFS 



C Miscellaneous Proofs 

Proof of Lemma |7/!2l For simplicity we assume that fx is symmetric around its maximum, the general unimodal 
case follow through essentially the same way. Since the property of having a regular tail is shift invariant, we can 
further assume without loss of generality that fx attains its maximum at (and is symmetric around) a; = 0. 

(i) By the assumption, there exist mo, mi > 0, 6 > a > 1 and xq > 1 so that for any > xo 

mo\x\-'' < fxix) < mi|x|-" 

Thus, for any x > Xq 



I - Fxix) < mi I y-'^dy=^x^-^<'^^^-^f^{x) 
' a — 1 1 — a 



and similarly 



l-b 

6-1 



1 - Fx{x) > fx" [x) 



Identical derivations hold for Fx{x) and x < —xq, and thus setting 7 = 1 — Fx{xq) the tail regularity is 
established. 

(ii) By the assumption, there exist < mp < mi, a > 1, 6 > and xq > 1 so that for any |a;| > xq 
Thus, for any x > xq 

"^1 



l-Fx{x)<mi e-'^y dy<mi (^j e^**^ rfy < mi ——^e-'''dz = 

< zfx{x) 

moab 

and on the other hand 

/•OO 

(1 - Fx{x)) [ah +{a- l)x-'') > m^ j {ah + (a - l)x-'')e"''^''dy 

J X 

> TOO / {ab + (a - l)y °)e ^ dy = -mo——^ 

J X y 

where (a) is easily verified by differentiation. Thus for any x > xq 



abx^ 



TOo- 



l-Fx{x) > , """^ e-''"° >mom"*/|(x) 

^ ^ - abx" +a-l - 'J 1 •'^^ ^ 



where the last inequality holds for x > xq with suitable selection of (3 > b. Identical derivations hold for 
Fx (x) and x < — xo, and thus setting 7 = 1— Fx (xq) the tail regularity is established. 

□ 

Proof of LemmaWH (i) Let < M = inf f^B{(t>\0). Since supp(e, $) is convex in the 0-direction and 

supp(0,$) 

/$|e((/>|6') = /e|4,(6'|(/)), we have that "./||e(0l^) > M over supp(e, $). Therefore: 
< i?(P$|e ir^lie I ^e) = / / /*|e log ^""^"^'^^ 



supp(e,<[>) 

< yy U\ei't>\0) log ^^1^^^^^'^^ dOdc^ = - (fe($|9) + logM) < oo 

supp(0,<t>) 

where in the last inequality we used the finiteness of the joint entropy and Q ^ U. The same holds for 
F>{P<i>\Q II ^^'||e I Pb), concluding the proof. 
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(ii) Since both Fx , Fy are now bijective, we have that 

F^^eim ^ F^ix(Fy\cb)\F^\e)) 

and thus 

f (Mf)^ ^ (p rp-v^MP-vm^^ fY\x{Fy\mx'm fx\Y{Fx\o))\Fy'm 
u^eim = ^ {Fy.xiFy mFx m = = — }-(^^T(^)^ — 

We can therefore write 

where m = sup /x (a;) < oo. Denote the max-to-min ratio bound by 

^j- ^ ^ SUp^Sgupp(X|Y=y) fx\Yix\Y ^ y) 

yesupp(Y) V iiif2;esupp(x|y) /x|y(2;|2/) 
The relative entropy -D(/$|e I! ~/||e) '^^^ upper bounded as follows: 

DiP^^e\\-Piie\Pe)< 

^ r r fxiY{Fx\0)\Fy'm ^ fxiY{F^\e)\Fy\cp)) 

- J J fx{F^\e)) °^/x(F^\^))-m-i-inf fx\Y{Fx\0)\Fy'm 

supp(e,*) cejr(0,e) 

= log(m)+ / / fx\Y{x\y)fy{y) log ( -1- • /"""^^y /.LA ) ^^^^^ 



inf J ( 

supp(X,Y) ^ e vtf, ; y 

< log (m) + + log M < oo ill) 

where a straightforward change of variables was performed, and J~ {y, x) is the counterpart of J~ {(j), 9). In 
the last inequality we used the fact that siipp(X, Y) is convex in the y-direction, which impHes that {y,x) C 
supp(X|Y = y). Furthermore, h{X) is finite since fx is proper and bounded. 

(iii) We prove the claim under the lenient assumption that fxiviMv) is also symmetric for any fixed y. The 
argument for the general claim is a similar yet more tedious version of this proof. We need the following 
Lemma: 

Lemma C.l. Suppose X is proper with a symmetric unimodal p.d.f., a finite variance a^, and a regular tail 
with parameters 7, Ci, Ui. Define 



inf fx{x), e{x) ^ F-' (\fx{x) 



and let 

1-7* 



7* = min(7, -) , M = sup /jf (a;) , = — [^j — a 



Then 



DifxWrx) < «r' log ^ + (1 + ai - ao) logM + log Afi 

Co 



Proof. Without loss of generality we can assume that 7 < i, since a larger value implies a regular tail for any 
smaller value. Define X2 < a;i < < to be 

xo = F^' (7) , XI = F^' (1) , X2 = F^' (J) 
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It is easy to see that e(.To) = xi and e{xi) = X2- Defining M = sup fx{x) we can lower bound \xi \ using 
symmetry: 

2|xi|M>l-7 ^ l^il^^ 

Using Chebyshev's inequality and symmetry, we can upper bound \x2 \ by 

'2 [2 

\X2\ < W-cr 



poo 2 

2 f^{x)dx = l<^ 



I\X2\ ^ ^2 V 7 

Combining the above and using the monotonicity of / for a: < 0, we have 

7 



fx{xi)-{\x2\-\x,\)> ^ 



which yields a lower bound for fx (xi): 



and since fx is symmetric and unimodal and by the assumption 7 < i, it is readily verified that 

fxi^') ^ fx(e{x)) X e {-00, xq) 

fxix)>fxixi)>Mi xe{xo,\xo\) 

fx{x) = fx{x) xe{\xo\,^) (78) 

Now, recall that / has a regular tail, which is this symmetric case means that (recall that xq < 0) 

cof''°ix)<F{x)<cJ"'{x) \x\>\xo\ 
Let us upper bound the relative entropy between fx , fx using the above together with (iTSl i: 

Difx\\f*x)= r fx{x)log^dx+ f^°^ fx(x)\og^dx+ r f^(x)log^dx 

J-oc Ix\^/ Jxo JX\^) Jlxol Jx\^) 



< / fxix)\og dx+ / fx{x)log — dx + / fx{x)logldx 

< a7^ log — - + (1 + ai — ao) log M — log Mi < 00 

Co 



□ 



Returning to the pursued claim, let 7, Ci, ai be the common tail parameters of /x|y (-ly), let M = sup fx\Y{x\y) 
and let be an upper bound on the variance of /x|y( |?/) for all y. It follows from definition that for any y 

jnf fx\Y{z\y) > fx\Yi^\y) 

where fx\Y defined as in Lemma ICTI We now follow the derivations of the previous claim (jnli up to dTTl l. 
and use the above inequality and Lemma lCTI to obtain: 



D{P^^e II "^'lie I ^e) < log(m) + hiX) + f fy{y)dy f fx\Y{x\y)\o: 

Jsupp(Y) JsuDp(XIY=v~l 



The same proof holds for Z?(P$|e || PLq I Pe) 



.fx\Y{x\y) 

supp(Y) "'supp(X|Y=y) Jx\Yy™) 

log (m) + h{x) + [ fY{y)D{fx\Y{-\y) II rx\Yi-\y))dy 

supp(Y) 

log (m) + h[X) + a^^ log — + (1 + ai - ao) log A/ - logA/i < 00 

Co 
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(iv) A direct consequence of dU. 
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